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(57) ABSTRACT 

A method and apparatus for clocking an integrated circuit. 
The apparatus includes an integrated circuit having a clock 
driver disposed in a first side of a semiconductor substrate, 
and a clock distribution network coupled to the clock driver 
and disposed in a second side of the semiconductor substrate 
to send a clock signal to clock an area of the integrated 
circuit. 
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METHOD FOR DISTRIBUTING A CLOCK 
ON THE SILICON BACKSIDE OF AN 
INTEGRATED CIRCUIT 

This application is a division of Ser. No. 08/938,486 filed 
Sep. 30, 1997 now U.S. Pat. No. 6,037,822. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The invention relates generally to integrated circuit clock- 
ing and, more specifically to clock signal distribution in 
integrated circuits. 

2. Description of Related Art 

An issue facing the integrated circuit industry today is the 
problem of distributing clock signals throughout an inte- 
grated circuit die with low clock skew. Clock skew is the 
difference in arrival times of clock edges to different parts of 
the chip. Synchronous digital logic requires precise clocks 
for latching data. Ideal synchronous logic relies on clocks 
arriving simultaneously to all the circuits. Clock skew 
reduces the maximum frequency of the circuit as the circuit 
must be designed for the worst case skew to operate reliably. 

The challenge facing integrated circuit designers is to 
insure that the clock switches at exactly the same time 
throughout the chip so that each circuit is kept in step to 
avoid delays that can cause chip failure. In prior art global 
clock distribution networks, clock skew caused by signal 
routing is typically controlled by the use of hierarchical 
H-trees. FIG. 1 is a diagram illustrating such a hierarchical 
H-tree clock distribution network 101 that is implemented in 
high-speed integrated circuits to reduce the clock skew 
effect. As shown in FIG. 1, a clock driver 103 is used to drive 
H-tree network 101 at center node 105. It is appreciated that 
clock driver 103 is typically a very large driver in order to 
provide sufficient drive to H-tree network 101, which typi- 
cally has a large capacitance in complex, high-speed inte- 
grated circuits as will be described below. As observed in 
FIG. 1, the clock paths of the "H" formed between nodes 
107, 109, 111, and 113 have equal length between center 
node 105 and each of the peripheral points of the "H" at 
nodes 107, 109, HI, and 113, respectively. Therefore, 
assuming a uniform propagation delay of a clock signal per 
unit length of the H-tree network 101, there should be no 
clock skew between the clock signal supplied to nodes 107, 
109, 111, and 113 from clock driver 103. 

FIG. 1 further illustrates H-tree network 101 taken to 
another hierarchical level with an "H" coupled to each 
respective peripheral node of the first level "H". 
Accordingly, every peripheral node 115 is an equal distance 
from node 107. Every peripheral node 117 is an equal 
distance from node 109. Every peripheral node 119 is an 
equal distance from node 111. Finally, every peripheral node 
121 is an equal distance from 113. Thus, the clock paths 
from all nodes labeled 115, 117, 119, and 121 are an equal 
distance from clock driver 103 and therefore should have no 
clock skew between them (assuming a uniform propagation 
delay) since the clock delay from clock driver 103 should be 
equal at all peripheral nodes of the H-tree network 101. 
Thus, each node 115, 117, 119, and 121 can be configured to 
act as a receiving station for a clock signal and service the 
clocking requirements of an area of the integrated circuit 
near the node with negligible clock skew with reference to 
other similarly configured nodes of the H-tree network. 

As described, the H-tree propagation delay of a clock 
signal per unit length of the network may be controlled by 
placing every peripheral node an equal distance from clock 
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driver 103. However, the propagation delay of a clock signal 
because of length or distance of the paths traveled by the 
signal is only one ingredient that leads to skew. Another 
equally important ingredient is the consistency of speed of 

5 the signal as it traverses the path. One component that affects 
the speed of this signal is the resistance of the metal. Metal 
layers, such as Aluminum (Al), have an inherent resistivity 
that is a property of the metal, but the actual resistance a 
signal encounters can be affected by the thickness of the 

1Q metal layer, because resistance is inversely proportional to 
layer thickness. In general, however, clock metal layer 
thickness is approximately 1.5 microns (/an) making the 
resistance of the metal fairly consistent or predictable in 
most integrated circuits. 

15 The consistency of the speed of the signal in prior art 
clock distribution networks also depends generally on the 
impedance the signal encounters as it travels from the clock 
driver to the receiving station or clocked circuit. For a 
modern integrated circuit, there could be five or more metal 

20 interconnect layers on a chip, each interconnect layer sepa- 
rated from the other by a dielectric layer. The conventional 
clock network, such as H-tree network 101 overlays this 
structure. The clock network is laid out on a dielectric 
preferably overlying a ground plane metal. The speed of the 

25 signals along the path of the network is governed by the 
capacitance created in the dielectric between the clock 
network and the ground plane metal. Further, this capaci- 
tance is not consistent or uniform across the chip. This is so 
because the topography of a given chip gives rise to local 

30 variations, such as variations in the thickness of interlayer 
dielectric material relative to the underlying layer of metal 
and the presence of or absence of underlying metal layers. 
Interlayer dielectric thickness is important relative to the 
next level of metal. Further, the capacitive coupling from 

35 nearby switching lines adds to or subtracts from the clock 
signal development. The described variations inherent in a 
chip illustrate that the capacitance is dynamic, and therefore 
it is difficult to control the impedance encountered by the 
clock network, and thus the signal speed. In general, there is 

40 an inherent raw skew in the H-tree network due to this 
variation in signal speed of at least 150-200 pico seconds. 

One effort to eliminate the skew caused by delays in 
signal speed is through the use of variable delay buffers (also 
referred to as deskew buffers) at the ends of the H-tree. The 

45 additional intentional skew introduced by these special 
buffers is controlled by a carefully distributed reference 
clock whose skew is made small. In this way, the main 
clocks at each of the endpoints of the H-tree are synchro- 
nized to the low skew reference clock. Although this scheme 

50 is very effective in reducing clock skew, the deskew buffers 
can cause additional jitter on the main clock due to the 
presence of any power supply noise internal to the chip. 
Hence, reduced skew is traded for increased jitter. 
A second effort to eliminate or reduce timing skew is to 

55 use copper (Cu) as the interconnect metal forming the clock 
distribution. Since the consistency of the signal propagation 
is a function of the product of the capacitance and resistance, 
reducing the resistance reduces the sensitivity of the signal 
propagation to variations in the capacitance. The resistance 

60 of copper interconnect can be up to 50% lower than that of 
conventional Al-0.5% Cu. However, as the clock rate keeps 
climbing, even the resistance improvements provided by 
copper metallization may not be sufficient to control skew. 
Even with sophisticated clock network configurations like 

65 the H-tree network, deskew circuits, and copper 
metallization, integrated circuits typically have a skew bud- 
get built into them that allows the circuits to tolerate a 
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certain amount of skew after which point the processing 
speed must be reduced. A general rule of thumb for a skew 
budget is a clock skew of 10% of the clock frequency. 

In addition to clock skew and jitter, the clock distribution 
on the chip consumes valuable routing resources on inte- 5 
grated circuits that could be better used for signals and 
improved signal routability. As noted above, a preferred 
clock network routing is on top of a chip above a ground 
plane metal layer and separated by a dielectric layer. This 
preferred routing requires two layers of metal. 10 

In addition to integrated circuit die area, the global clock 
distribution networks utilized today consume an increasing 
amount of power. If the total capacitance of the clock 
network is C, the power dissipated is CV 2 f where V is the 
supply voltage and f is the frequency. The global clock 15 
distribution network on today's high-speed integrated circuit 
chips typically accounts for approximately 10% of the chip 
power. 

The clock distribution network on a chip must be com- 2Q 
patible with the chip package. The conventional packaging 
of a chip is illustrated in FIG. 2. FIG. 2 is an illustration of 
a chip 205 packaged in a wire bond package 211. As shown 
in FIG. 2, wire bonds 203 for example, gold wire bonds, are 
used to connect package 211 and chip 205. ^ 

Another type of packaging, is the Controlled Collapse 
Chip Connection (C4) packaged chips (sometimes referred 
to as flip-chip). FIG. 3 is an illustration of a C4 package 251. 
C4 is the packaging of choice for high frequency chips as it 
provides high density, low inductance connections using ball 30 
bonds 253 between chip 255 and package 261 by eliminat- 
ing the high inductance bond wires that are in wire bond 
packages. 

SUMMARY OF THE INVENTION 35 

A method and apparatus for clocking an integrated circuit 
is described. An apparatus includes an integrated circuit 
having a clock driver disposed in a first side of a semicon- 
ductor substrate, and a clock distribution network coupled to 
the clock driver and disposed in a second side of the 40 
semiconductor substrate to send a clock signal to clock an 
area of the integrated circuit. Additional features and ben- 
efits of the invention will become apparent from the detailed 
description, figures, and claims set forth below. 

45 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a schematic illustration of a prior art hierarchical 
H-tree clock distribution network that would be used to 
route clock signals above the chip circuitry. 

FIG. 2 is a schematic illustration of a prior art wire-bond 50 
packaging technology. 

FIG. 3 is a schematic illustration of a prior art flip-chip or 
C4 packaging technology. 

FIG. 4 is a circuit diagram of a clock driver that is used 55 
in accordance with the invention to receive a signal from a 
master clock and transmit a single-ended clock signal to a 
clock network. 

FIG. 5 is a schematic illustration of a cross-sectional side 
view portion of an inverted semiconductor substrate having 60 
a CMOS structure formed in the front side of the substrate 
and showing a masking layer overlying the backside of the 
substrate and a via formed in the substrate to diffusion/ 
contacts for an embodiment of an integrated circuit in 
accordance with the invention. $5 

FIG. 6 is a schematic illustration of a cross-sectional side 
view portion of an inverted semiconductor substrate having 



168 Bl 

4 

a CMOS structure formed in the frontside of the substrate 
and showing a dielectric material passivating the sidewalls 
of the via in the backside of the substrate in accordance with 
the invention. 

FIG. 7 is a schematic illustration of a cross-sectional side 
view portion of an inverted semiconductor substrate having 
a CMOS circuit formed in the frontside of the substrate and 
showing a conductive material plug deposited in a passi- 
vated via formed in the backside of the substrate to diffusion, 
and a clock network line coupled to the plug in accordance 
with the invention. 

FIG. 8 is a schematic illustration of a portion of an 
inverted semiconductor structure having a trench outlining a 
hierarchical H-tree clock network in the substrate backside 
for recessing a clock network. 

FIG. 9 is a schematic illustration of a portion of an 
inverted semiconductor substrate in which a trench outlining 
a hierarchical H-tree clock network has been formed in the 
substrate the trench being passivated and filled with metal 
that forms the network in accordance with the invention. 

FIG. 10 is a schematic illustration of a cross-sectional side 
view portion of an inverted semiconductor substrate having 
a CMOS circuit formed in the frontside of the substrate and 
showing a second dielectric layer overlying the clock net- 
work in accordance with the invention. 

FIG. 11 is a schematic illustration of a cross-sectional side 
view portion of an inverted semiconductor substrate having 
a CMOS circuit on the frontside of the substrate and 
showing a heat sink coupled to the backside of the substrate 
in accordance with the invention. 

FIG. 12 is a schematic illustration of a cross-sectional side 
view of a portion of an inverted substrate taken through line 
A — A of FIG. 11 in accordance with the invention. 

FIG. 13 is a schematic illustration of a side view portion 
of a clock distribution network routed on the backside of a 
semiconductor substrate from a clock driver to a receiving 
station in accordance with the invention. 

FIG. 14 is a schematic cross-sectional top view taken 
through line B — B of FIG. 13 and displaying a window 
through the heat sink and the network passivation layer. 

FIG. 15 is a block diagram illustrating the distribution of 
a differential clock network on the backside of a semicon- 
ductor substrate. 

FIG. 16 is a schematic illustration of a cross-sectional side 
view portion of an inverted semiconductor substrate show- 
ing a coplanar waveguide transmission line or waveguide to 
implement a differential clock network routed on the back- 
side of the substrate. 

DETAILED DESCRIPTION OF THE 
INVENTION 

A method and apparatus for clocking an integrated circuit 
in a semiconductor substrate is disclosed. In the following 
detailed description, numerous specific details are set forth 
in order to provide a thorough understanding of the inven- 
tion. It will be apparent, however, to one having ordinary 
skill in the art that the specific detail need not be employed 
to practice the invention. In other instances, well known 
materials or methods have not been described in detail in 
order to avoid obscuring the invention. 

The invention provides a method and apparatus for clock- 
ing an integrated circuit by routing the clock network along 
the backside of the semiconductor substrate and bringing 
clock signals through the backside of the semiconductor 
substrate to the diffusion/contact regions of the individual 
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devices that make up the integrated circuit and that are In FIG. 5, the gate 310 of the NMOS device is connected 

controlled by a master clock. The invention makes use of to the gate 315 of the PMOS device. NMOS gate 310 has 

available area on the backside of a chip and is particularly adjacent source region 320 and drain region 325. Similarly, 

compatible with flip<hip (C4) technology with the routing PM . 0S fijf f 15 has adjacent source region 335 and drain 

f .u , , , .u w.* . u ut.-A- w-- . m c region 330. A polysihcon interconnect layer 332 lies adja- 

of the clock network on the substrate backside being com- 5 J q{ ^ ^ ^ NMQS deyices * ^^saric layer 

pletely compatible and functional with existing heat sink 337> for examplej a plasma deposited silicon oxide (SiOJ, 

technology. The removal of clock network routing from the overlies the structure. Vias are formed to the diffusion 

frontside of the integrated circuit frees up routing resources. regions and gates. A conductive plug material 340, such as 

Further, additional levels of metal are not needed to control for example, tungsten (W), fills the vias. Metal tracks 342, 

clock skew, for instance, where an additional ground plane 10 in the first layer of metal interconnect using, for example, 

metal overlies the structure with the clock overlying the aluminum (Al/Cu), are coupled to plug material 340 to 

ground plane metal and separated by an interlayer dielectric. connect the adjacent gates 310 and 315, respectively. FIG. 5 

° t . , . , . . , Ct , also illustrates metal tracks, e.g., Al/Cu tracks, 345 and 347 

By routmg the clock network on the backside of the chip, M tQ , material M0 ^ of the and 

clock skew can be minimized and controlled. The .semicon- pM ^ s respectivelV) t0 form to these 

ductor substrate of the chip is a grounded conductor for 15 regioils Als0j metal Kzcks 348, e.g., Al/Cu tracks, are 

which capacitance, and thus signal speed, can be completely coupled to plug material 340 deposited in vias to drains 325 

controlled. Thus, the clock routing on the backside elimi- anc j 330 0 f the NMOS and PMOS devices, respectively, to 

nates the need for deskew circuits and the accompanying connect drains 325 and 330 to one another. A second layer 

jitter. By routing the clock over a dielectric having a low of dielectric 349, for example, a silicon dioxide (SiOJ glass, 

dielectric constant, the capacitance of the clock distribution 2 o overlies the first metallization layer containing tracks 342, 

may be reduced leading to reduced power or a smaller clock 345^ 347^ 348, respectively. It is to be appreciated, that 

driver. Further, since the backside of the chip is a grounded integrated circuits may have five or more metal interconnect 

plane, there is less impedance change. i ayerSt m FIGS. 5-12 only one metal layer, illustrated 

FIGS. 4-12 illustrate one embodiment of a method of collectively as 342, 345, 347, and 348, respectively, is 

routing a clock network on the backside of a semiconductor ^ shown to facilitate understanding of the invention. The 

substrate chip that is compatible with flip -chip (C4) tech- invention is suitable for devices having multiple levels of 

nology. interconnect on the frontside of substrate 300. 

FIG. 4 illustrates by way of a circuit diagram the distri- On the backside of substrate 300, via 355 is formed to 

bution of a master clock signal through a clock network to drain/contacts 325 of the NMOS device. The connection 

multiple receiving stations. In this embodiment, the clock 3Q could also be made to the PMOS device or both the NMOS 

network is, for example, an H-tree that is routed on the and PMOS devices. Connection to the NMOS device only is 

backside of a chip. In this embodiment, the sequence of steps shown for clarity. First, a masking layer 350, such as for 

involved in fabricating such a structure are: (1) Generate, in example, a silicon nitride (Si 3 N 4 ) masking layer 350, is 

the silicon substrate, vias that are filled with metal plugs and deposited over the backside of substrate 300 to protect 

which form the medium for electrically connecting circuitry substrate 300 from a subsequent etchant and to define a via. 

on the chip front side to the clock network on the chip 35 Next, the backside of substrate 300 is exposed to a suitable 

backside; (2) fabricate trenches in the backside on the silicon etchant to form via 355 in substrate 300 to drain 325. Since 

substrate in the outline of the H-tree clock network; and (3) the substrate thickness is likely, greater than 500 microns 

fill the trenches with metal that forms the clock network. Oum), a fast etch method is used to etch via 355. A suitable 

These various steps will be described in detail below. etch method is wet anisotropic etching of the silicon along 

FIG. 4 includes a clock driver 375 that is an inverter 40 the 111 planes using, for example, potassium hydroxide 

circuit, such as a complimentary metal oxide semiconductor (KOH). Such an etch method generates a tapered hole. 

("CMOS") field effect transistor. Clock driver 375 includes Plasma etching could also be utilized, such as for example, 

an NMOS device (with gate 310, source region 320, and using an SF 6 etch chemistry in a reactive ion etcher (RIE) or 

drain region 325) and a PMOS device (with gate 315, source an electron cyclotron resonance (ECR) etcher, 

region 335, and drain region 330). NMOS gate 310 is 45 FIG. 5, the CMOS circuit shown is a clock driver 375. A 

connected to PMOS gate 315 and the two are connected to clock signal is delivered, for example, to gate 342. The 

a master clock. output of clock driver 375, i.e., drain 330 and drain 335 

FIG. 4 shows a clock network in which a master clock connected to metal 348, is connected to the clock network, 

drives clock driver 375. Adjacent clock driver 375 are Once via 355 is formed in the backside of substrate 300 

receiving inverters (labeled I, II, N) connected to clock 50 to the inverter that is, for example, clock driver 375 or other 

driver 375 (illustrated by node 327) to receive the clock clocked circuit, FIG. 6 illustrates the further processing steps 

signal. to route the clock on the chip backside. FIG. 6 is a schematic 

FIG. 5 shows a schematic illustration of a cross-sectional illustration of a cross-sectional side view portion of inverted 

side view portion of an integrated circuit structure. The semiconductor substrate 300. In FIG. 6, masking layer 350 

integrated circuit structure has been inverted so that what is 55 has been removed, and a dielectric interface layer 360 is 

conventionally considered the frontside of the chip appears formed along the side wall of via 355 to passivate via 355. 

at the bottom of the figure. The portion of the integrated Dielectric interface layer 360 may be deposited by conven- 

circuit illustrated in FIG. 5 shows a semiconductor substrate tional techniques, such as for example, chemical vapor 

300 of, for example, silicon, having embedded in substrate deposition of dielectric material, or may be grown, for 

300 and formed thereon a conventional complimentary example, such as, thermal Si0 2 . Dielectric interface layer 

metal oxide semiconductor ("CMOS") field effect transistor 60 360 seals off any exposed silicon in via 355 and passivates 

inverter. The CMOS inverter consists of both an NMOS via 355. Dielectric interface layer 360 serves as an interface 

device and a PMOS device, separated in this illustration by between substrate 300 and the conducting material that will 

shallow-trench isolation techniques (denoted by dielectric. ultimately fill via 355. At this point, any dielectric material 

filled trenches labeled "STT). It is to be appreciated that formed in the bottom of the via is removed, for example, by 

other isolation techniques, such as for example, Local Oxi- 65 an anisotropic etch, such as a reactive ion etch, 

dation of Silicon (LOCOS), can be used to isolate the In one embodiment, a dielectric material 360 with a low 

devices of the circuit. dielectric constant, on the order of 4.1 or less, is formed 
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along the sidewalls of via 355. The low dielectric constant 
material reduces the capacitive load required of the clock 
network so the power requirements of the clock driver can 
be reduced, since the power (the CVH power) dissipated is 
completely due to the capacitive load that is driven. Dielec* 5 
trie materials with low dielectric constants include, but are 
not limited to, Si0 2 , Si0 2 xerogel, fluorinated amorphous 
carbon, fluorinated Si0 2 , various fluorinated polymers, and 
hydrogen silisesquioxane (HSQ). A barrier layer, such as 
titanium nitride, as used in conventional CMOS processing, 
may be applied to the sidewalls to improve adhesion 10 
between a conductive material and the via sidewalls. 

FIG. 6 also illustrates the subsequent processing step of 
depositing a conductive plug material 370, such as for 
example tungsten (W), into via 355, The substrate backside 
is then etched, for example, by a plasma etch, to remove J5 
excess plug material and interface layer material from the 
substrate backside. 

Once the through-silicon vias are formed and filled with 
metal plugs, FIG. 7 illustrates the next step to fabricate the 
actual H-tree on the backside of the silicon substrate. In one 
embodiment, the invention contemplates that the clock 20 
network be recessed into the backside of substrate 300 so as 
to be compatible with existing heat sink technology. FIG. 8 
shows a schematic top view illustration of a portion of a 
semiconductor substrate having a trench 358 outlining a 
portion of hierarchical H-tree clock network formed into the 25 
backside of semiconductor substrate 300. In this example, 
via 355 represents a via to clock driver 375 which is the 
CMOS inverter illustrated in FIG. 5 (and the circuit diagram 
illustrated in FIG. 4) filled with conductive plug material 
370. Since the via hole is tapered and not drawn to scale, it 30 
is to be appreciated that the diameter of the hole on the 
silicon backside can be much bigger than the width of the 
interconnect. The recessed network connects the clock 
driver to the receiving stations about the chip. 

The recessed clock network illustrated in FIG. 8 may be 35 
formed by conventional trenching techniques. For example, 
the network may be formed by applying a masking layer, 
such as for example, silicon nitride (Si 3 N 4 ), to the backside 
of substrate 300 and exposing an area on substrate 300 that 
will accommodate the clock network. Alternatively, a direct 
write, silicon micromachining technology such as chlorine- 40 
enhanced laser etching or laser ablation could also be used. 
A short recess, for example, a shallow trench equivalent to 
the desired thickness of the fill material, is etched into 
substrate 350. Next, the sidewalls of the clock network 
trench are passivated and, as illustrated in FIG. 7, a con- 45 
ductive metal line 376, such as for example aluminum 
(Al/Cu), that is the clock network is deposited over the top 
of conductive material 370 to make an electrical connection 
to conductive material 370. 

FIG. 7 shows clock network 376 overlying conductive 50 
plug material 370 in via 355. Clock network 376 is separated 
or isolated from substrate 300 by dielectric layer 360. Clock 
network 376 is also recessed relative to the surface of 
substrate 300. Recessing of clock network 376 is optional 
and will allow the subsequent attachment of a heat sink to 55 
the backside of substrate 300 with minimal disruption of 
existing processes. 

As noted above with respect to FIG. 8 and the accompa- 
nying text, a shallow trench can be formed in substrate 300 
to define clock network 376, such as for example an H-tree 
network as illustrated in the schematic top view of FIG. 9, 60 
to recess clock network 376 in substrate 300. As noted with 
respect to FIG. 6 and the accompanying text, the trench is 
shallower than the vias formed to the individual receiving 
stations of the network. It is to be appreciated that the depth 
of the clock network outline trench is primarily a function of 65 
the desired thickness of the clock network conductive mate- 
rial 376. 



The invention contemplates that the metallization layer 
376 that forms the clock network may be thicker than 
conventional metallization layers. A thicker metallization 
layer 376 (i.e., a larger aspect ratio), such as for example, on 
the order of 5.0 microns (as compared to 1.5 microns as in 
the prior art), reduces the resistance of the metal, because 
resistance is inversely proportional to metal layer thickness 
(R=resistivity/layer thickness). Lowering the resistance 
therefore reduces the "RC delay," which is a common 
measure of chip circuit speed. 

FIG. 10 is a schematic illustration of a cross-sectional side 
view portion of the inverted semiconductor substrate. In 
FIG. 10, dielectric layer 380 is formed over clock network 
376. Dielectric layer 380 may be deposited by conventional 
techniques, such as for example, chemical vapor deposition 
of dielectric material. Dielectric layer 380 serves to passi- 
vate and protect the otherwise exposed portion of clock 
network 376 (such as for example an H-tree network) thus 
isolating clock network 376 from the heat sink that will 
subsequently be applied to substrate 300, grease, air, etc. 
Suitable dielectric material for dielectric layer 380 includes 
Si0 2 . Dielectric layer 380 may also be the same material as 
interface layer 360. 

FIG. 11 is a schematic illustration of a cross-sectional side 
view portion of the inverted semiconductor substrate. In 
FIG. 11, heat sink 400 is attached to substrate 300. Because 
the clock network, represented by conductive layer 376, is 
recessed, heat sink 400 conforms to the backside of substrate 
300 as in prior structures. Thus, routing clock network 376 
on the backside of substrate 300 provides minimal disrup- 
tion to conventional attachment of the heat sink to the 
substrate backside in C4 packaging technology. 

FIG. 12 is a schematic illustration of a cross-sectional side 
view portion of inverted semiconductor substrate 300 taken 
through line A — A of FIG. 11. FIG. 12 shows conductive 
plug material 370 connected to diffusion/contact region 325. 
Conductive plug material 370 is electrically isolated from 
substrate 300 by dielectric interface layer 360. Clock net- 
work 375 overlies conductive plug material 370 and is 
likewise electrically isolated from substrate 300. Finally, 
dielectric layer 380, which can be the same material as 
interface layer 360, overlies clock network 376 to protect 
clock network 376 from heat sink 400, grease, air, etc. 

FIG. 13 is a schematic illustration of a cross-sectional side 
view portion of inverted semiconductor substrate 300. In 
FIG. 13, additional circuitry is shown disposed in substrate 
300. Inverter circuit 375 is the clock driver which is coupled 
to a master clock. Also displayed is a second inverter or 
receiving circuit 475 adjacent to a dufusion region or 
receiving station 425 for a clock signal from clock driver 
375. Receiving station 325 is electrically coupled to gate 430 
of receiving circuit 415 by metallization layer 435, such as 
for example by aluminum (Al) coupled to tungsten (W) 
filled vias to diffusion/contact region 425 and gate 430, 
respectively. Receiving station 425 receives a signal from 
clock driver 375 on clock network 376 and transmits that 
signal to transistor gate 430. Thus, FIG. 13 shows a portion 
(collectively labeled 480) of the clock network 376 wherein 
a receiving circuit 475 is coupled to clock driver 375. 

FIG. 14 is a schematic cross-sectional top view taken 
through line B — B of FIG. 13. FIG. 14 shows heat sink 400 
overlying substrate 300, with a portion 480 displaying a 
window through heat sink 400 and dielectric layer 380. 
Portion 480 shows clock network 376 coupled to clock 
driver 375 and receiving circuit 475. 

The above discussion focused on a clock network that 
constituted a single-ended connection between the transmit- 
ter and the receiver. The signal return path in this case is 
complex and through the ground return path of the chip. The 
clock driver and receiver could also be connected together 
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with a differential connection in which the signal return path 
is precisely defined. 

FIG. 15 schematically illustrates by way of a block 
diagram a differential clock network technique that can also 
be employed on the backside of a semiconductor device. 5 
FIG. 15 shows two single connections between driver and 
receiver of a differential clock routed on the backside of a 
semiconductor substrate. FIG. 16 is a schematic illustration 
of a cross-sectional side view portion of an inverted semi- 
conductor (e.g., silicon) substrate 500. Conductive metal 10 
(e.g., aluminum) lines 560 and 565 make up the signal paths 
for a differential clock routing on the backside of substrate 
500. For example, conductive metal line 560 carries the 
clocking signal from the clock driver while conductive metal 
line 565 carries the inverse of the clocking signal. Each line 
of the differential connection is configured similarly to the 
way a single-ended connection is configured. This arrange- 
ment is sometimes also referred to as a cop lan a r transmis- 
sion line or waveguide. The chief structural difference 
between the two configurations, is that for a differential 
connection, two single connections are required between 20 
driver and receiver. Conductive metal lines 560 and 565 are 
isolated from one another and from substrate 500 by a 
dielectric material 570. In one embodiment, dielectric mate- 
rial 570 has a dielectric constant on the order of 4.1 or less. 
Conductive material lines 560 and 565 are recessed in a 25 
clock network trench such that an overlay of dielectric 
material 580 is deposited to passivate the lines and protect 
the lines from heat sink 600, grease, air, etc. FIG. 14 shows 
a coplanar waveguide formed on the substrate backside. 

The differential clock routing utilizes a coplanar 30 
waveguide which offers better clocking properties than a 
single-ended connection since the return path may be con- 
trolled more precisely. 

The advantages of routing the clock network on the 
substrate backside are many. First, clock skew due to inner 35 
layer thickness variations and electrical activity on lower 
level metal is eliminated. Second, since there is a near 
perfect ground plane (the silicon substrate) with no topog- 
raphy over it, the impedance control is excellent. Third, the 
deskew circuits are eliminated, since propagation delay is 4Q 
uniform. This saves real estate and complexity and elimi- 
nates residual jitter. Fourth, since the clock distribution is 
moved to the silicon backside, more routing space is avail- 
able for signals, leading to a potentially smaller die size. 
Fifth, since there are no other non-clock signals nearby or 
any other metal routing layers above the clock network, the 45 
metal thickness of the network can be increased to a larger 
aspect ratio to reduce resistance and increase circuit speed. 

It is to be appreciated that in addition to minimizing clock 
skew, backside signal routing as previously described can be 
used to route other types of signals in a similar manner, 50 
particularly critical timing signals, inside an integrated cir- 
cuit chip. 

In the preceding detailed description the invention is 
described with reference to specific embodiments thereof. It 
will, however, be evident that various modifications and 55 
changes may be made thereto without departing from the 
broader spirit and scope of the invention as set forth in the 
claims. The specification and drawings are, accordingly, to 
be regarded in an illustrative rather than a restrictive sense. 

What is claimed is: 50 

1. A method comprising: 

forming a via in a second side of a semiconductor 

substrate to a driver circuit on a first side of the 

semiconductor substrate; 
routing a signal network on the second side of the 65 

substrate to a plurality of nodes on the first side of the 

substrate; 



depositing a conductive material in the via; and 
coupling the conductive material to the signal network. 

2. The method of claim 1, further comprising, prior to 
depositing the conductive material, introducing a layer of 
dielectric material along a sidewall of the via. 

3. The method of claim 2, wherein the plurality of nodes 
are coupled to a plurality of transistors that form an area of 
an integrated circuit, 

wherein forming a via in the second surface comprises 
forming a via to each of the plurality of transistors, and 

wherein depositing the conductive material in the via 
comprises depositing conductive material in each of the 
via to each of the plurality of transistors. 

4. The method of claim 3, further comprising: 
forming a trench in the second side, the trench having a 

bottom and sidewalls; and 
routing the signal network in the trench, wherein intro- 
ducing a layer of dielectric material includes, prior to 
routing the signal network in the trench, introducing a 
dielectric material along the bottom and about the 
sidewalls of said trench to passivate said trench. 

5. The method of claim 4, wherein the dielectric material 
is a first dielectric material, the method further comprising: 

depositing a second layer of dielectric material over the 
signal network. 

6. The method of claim 5, further comprising planarizing 
the second layer of dielectric material. 

7. The method of claim 2, wherein the dielectric material 
has a dielectric constant of 4.1 or less. 

8. The method of claim 1, wherein the signal network is 
comprised of a conductive material and is deposited such 
that the thickness of the conductive material is greater than 
1.5 microns. 

9. The method of claim 1, wherein the signal network is 
a single-ended clock network. 

10. The method of claim 1, wherein the driver circuit 
comprises a first clock driver to send a clock signal and a 
second clock driver to send the inverse of the clock signal, 
and wherein the signal network is a first clock distribution 
network, the method comprising: 

forming a via to the second clock driver, 

routing a second clock distribution network on the second 

side of the substrate; 
depositing a conductive material to the second via; and 
coupling the conductive material to the second clock 

distribution network on the second side of the substrate. 

11. The method of claim 1, wherein routing the signal 
network comprises recessing the signal network in the 
second side of the substrate. 

12. A method comprising: 

routing a signal network on a first side of a semiconductor 
substrate; 

coupling the signal network to a first device and to a 
second device on a second side of the semiconductor 
substrate opposite the first side; and 

passing a signal between the first device and the second 
device via the signal network. 

13. The method of claim 12, wherein routing the signal 
network comprises recessing the signal network in the 
second side of the substrate. 
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SEMICONDUCTOR INTEGRATED CIRCUIT 
DEVICE 

BACKGROUND OF THE INVENTION 

The present invention relates to a semiconductor inte- 
grated circuit device and, more particularly, to a semicon- 
ductor integrated circuit device having clock wiring with 
reduced clock skew. 

Some semiconductor integrated circuit devices, such as 
VLSIs, include a synchronous circuit having flip-flops 
driven by a common clock signal. To make such a synchro- 
nous circuit operate more rapidly, these semiconductor inte- 
grated circuit devices require that clock skew (i.e., differ- 
ences in clock supply timing between flip-flops) be 
minimized for removal of signal-to-signal timing differ- 
ences. 

Various layout design techniques for reducing such clock 
skew have been proposed. One such technique involves 
installing tree-structure paths between a clock signal gen- 
erator and a plurality of flip-flops, wherein the length of the 
path between the generator and each flip-flop is suitably 
adjusted. Another technique, which is disclosed in Japanese 
Published Unexamined Patent Application No. Hei 
9-307069, requires inserting clock buffers where appropriate 
when tree -structure wiring has been established, whereby 
the tree structure is readjusted so that the difference between 
a maximum and a minimum of delays on the readjusted 
wiring attains a predetermined value. Where there still 
remains clock skew despite the provision of tree structure 
wiring, another technique disclosed in Japanese Published 
Unexamined Patent Application No. Hei 8-274260 seeks to 
minimize the skew by replacing appropriate drivers with 
small-capacity drivers so that the paths with maximum skew 
become equal in skew level to other tree branch paths 
between second stage clock drivers and block circuits. 

The conventional techniques outlined above have failed to 
consider optimum arrangements of skew reduction for 
VLSIs. These techniques presuppose that on tree-structure 
paths between a clock generator and each flip-flop, each 
node is afforded wiring of an equal length. If equal-length 
wiring is provided ranging from a clock generator through a 
plurality of stages of drivers to flip-flops, alternative lines 
necessitated by the equal-length lines at all stages prolong 
the overall clock wiring. The resulting disadvantages 
include more delays of clock signals and higher power 
dissipation. 

Furthermore, the conventional techniques above have 
disregarded an optimum clock layout for each of the func- 
tional portions or for each of a plurality of clock phases in 
connection with LSIs. A VLSI comprises random logic 
circuits and data paths reflecting various functions of the 
device, as well as numerous I/O pads. The conventional 
techniques have so far shied away from providing any 
optimum clock layout for the diverse internal arrangements 
of the LSI. 

SUMMARY OF THE INVENTION 

It is therefore an object of the present invention to provide 
a semiconductor integrated circuit device having a clock 
skew-lowering layout that ensures reduced wiring delays, 
enhanced packaging density and low clock power dissipa- 
tion. 

It is another object of the present invention to provide a 
semiconductor integrated circuit device having an optimum 
clock layout corresponding to each of the functional portions 
constituting an LSI. 
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These and other objects, features and advantages of the 
invention will become more apparent upon a reading of the 
following description and appended drawings. 
Major features and benefits of the invention are outlined 

5 below. In carrying out the invention, and according to one 
aspect thereof, there is provided a semiconductor integrated 
circuit device comprising a plurality of stages of clock 
drivers furnished on clock wiring paths ranging from a clock 
generator to flip-flops. Clock lines connecting upper stage 

io clock drivers have an equal length each in the form of a tree 
structure, and clock lines connecting lower stage clock 
drivers have the shortest possible lengths. 

The lower the stage, the greater the number of clock 
drivers furnished. In that structure, clock lines connecting 

15 lower stage clock drivers are made to have not equal lengths 
but the shortest possible lengths. The arrangement shortens 
the overall clock wiring, reduces wiring delays, enhances 
packaging density, and lowers clock power dissipation. 
Since the lower stage clock drivers are connected by lines 

20 that are shorter than those connecting the upper stage clock 
drivers, the lower stage clock drivers may have the shortest 
possible wiring entailing negligible clock skew. Because the 
upper stage clock drivers are connected by extended wiring, 
the lines constituting such wiring are made to be equal in 

25 length in order to minimize clock skew. 

A semiconductor integrated circuit device according to 
another aspect of the invention also comprises a plurality of 
stages of clock drivers. Of these drivers, intermediate stage 
clock drivers are provided with clock logic circuits for 

30 controlling clock signal supply. 

The clock logic circuits control the supply of clock signals 
to individual function blocks corresponding to the interme- 
diate clock drivers in question. The setup implements a 
clock signal supply scheme suitable for a VLSI while 

35 minimizing clock skew. Preferably, next-to-last stage clock 
drivers may have clock logic circuits for supply of clock 
signals to the flip-flops of random logic circuits and input/ 
output pads, and both last stage and next-to-last stage clock 
drivers may have clock logic circuits for the supply of clock 

40 signals to the flip-flops of data paths. 

BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 is a schematic circuit diagram of clock logic 
circuits applicable to a semiconductor integrated circuit 
45 device embodying the invention; 

FIG. 2 is a top view of a clock layout on a chip carrying 
the semiconductor integrated circuit device embodying the 
invention; 

FIG. 3 is a detailed plan view of the layout of a region 204 
50 in FIG. 2; 

FIG. 4 is a more detailed plan view of the vicinity of a 
region 301 in FIG. 3; 
FIG. 5 is a detailed plan view of the layout of a region 206 
55 in FIG. 2; 

FIG. 6 is a more detailed plan view of the layout of a 
region 504 in FIG. 5; 

FIG. 7 is a set of schematic views depicting relations 
between clock drivers at different stages on the one hand and 
60 logic blocks on the other hand in the inventive semiconduc- 
tor integrated circuit device; 

FIG. 8 is a set of layout diagrams illustrating relations 
between the regions handled by the second stage clock 
drivers shown in FIG. 7 on the one hand and logic blocks on 
65 the other hand; 

FIG. 9 is a detailed plan view of clock drivers laid out in 
data paths; 
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FIG. 10 is a detailed plan view of clock drivers laid out lines 302 coming from the clock signal generator and 

in an input/output pad; implementing a three-phase clock scheme, as shown in FIG. 

FIG. 11 is a conceptual diagram showing how different 3 ; ™ e .three-phase clock scheme generates three kinds of 

stages of the inventive semiconductor integrated circuit **** a fi £<; lo <* ^ fastest c o ck ^al fed 

device are tvnicallv wired- * to a CPU and an FPU in the semiconductor integrated circuit 

device are typicauy wirea, device . a dock sigQal to bus access con- 

FIG. 12 is a set of explanatory diagrams indicating how troUers such as a DMAC (direct memory access controller) 

differences between clock delays are reduced over different anc j i/q pa ds; and a third clock signal fed to peripheral 

paths by use of clock wiring 1102; controllers. In FIG. 3, three root clock driver layout regions 

FIG. 13 is a schematic view depicting typical clock wiring 301 are furnished to match the three clock signals of the 

ranging from second stage clock drivers to third stage clock three-phase clock scheme. For purpose of explanation, FIG. 

drivers; and 1 indicates in unified fashion the three kinds of clock signal: 

FIG. 14 is a partially enlarged view of the clock wiring one clock signal fed to the flip-flops (106, 109) of the 

from the second stage clock drivers to the third stage clock random logic circuits; another clock signal supplied to the 

drivers in FIG. 13. 15 flip-fl°P s of the data paths; and another clock signal 

fed to the flip-flops (119) of the I/O pads. Of the three-phase 

DETAILED DESCRIPTION OF THE clock signals, the first clock signal is sent to the random 

EMBODIMENTS logic circuits and data paths, the second clock signal is given 

. til- to the random logic circuits and I/O pads, and the third clock 

FIG. 1 is a schematic diagram showing clock logic . . , * t 4 . , , - - 

.... • j . • * . a • ■* ™ signal is delivered to the random logic circuits, 

circuits applicable to a semiconductor integrated circuit 20 & , . , u-, *u j 

device embodying the invention. This embodiment com- So™ : flip-flops admit contrc > signals, whde others do not 

prises four stages of clock drivers through which a clock 1^^°^ ™ u P Tl" 

signal generator 101 supplies clock signals to all flip-flops fll P- flo P s ; ™ e f a Pf 5 * 09 *™ no flip-flops admitting 

106 inside the chip. The flip-flops are located in random contro1 sl S" als - In f ad > \ 1 ^ranged in each single 

logic circuits, data paths, and input/output pads. 25 column are ™" ed .Actively by a clock driver 112 that 

, , . serves as an AND circuit having a control signal input 

The clock drivers at each stage play the roles described terminal n5 Where &[ ^ are m lQ me ^ &tage 

below. Clock drivers 102, situated at the first stage as viewed dock drivcfS m on the daU paths> ^ flip . flops ^ the 

from the clock signal generator, are called root clock drivers daU hs have nQ need fof CQntrol terminals . struc ture 

TTiese drivers distribute throughout the entire chip the clock ^ acka m densit of the embodiment, 

signals output by the clock signal generator. A ^ ^ ^ ^ ^ ^ by unifying 

Second stage clock drivers 103 distribute clock signals to differences in arrival time between clock signals sent from 

third stage clock drivers 104. The drivers 104 are located in me dock sigQal generator 101 lo all flip.flops (called clock 

regions each made up of a number of logic blocks in the ddays hereunder ) t To uni fy lhe dock de i ays requ i rc s adjust- 

cm P* 35 ing both the driving force of clock drivers and the load 

The third stage clock drivers 104 and fourth stage clock capacities associated with the clock drivers. The load capac- 

drivers 105 serve to distribute clock signals to all flip-flops j ty 0 f a c i oc k driver is determined by the total sum of the 

in the logic blocks. If the third stage clock drivers 104 are capacity of a line connected to the driver in question and the 

constituted logically to control the supply of clock signals, capacity of the input terminal of a fan-out destination cell, 

it is possible to control clock signal supplies on a block-by- 4Q j n tne i og j c setU p 0 f this embodiment, the driving forces of 

block basis. the clock drivers at each stage and the load capacities 

Each third stage clock driver 104 supplies clock signals to associated therewith are adjusted so as to unify the clock 

a group of fourth stage clock drivers 105 distributed in each delays involved, thereby harmonizing all clock delays. The 

of the logic blocks. The fourth stage clock drivers 105 clock drivers 102 and 103 use cells of the same type 

supply clock signals directly to the flip-flops 106. Each 45 throughout all paths, each driver having an identical fan-out 

driver 105 feeds clock signals to a group of flip-flops 106 count and an equal wiring length. The clock drivers 104 and 

distributed in the logic blocks of random logic circuits. Each 105 have different fan-out counts at each stage but share the 

data path 209 supplies clock signals to a column of flip-flops same total capacity including wiring capacity, with the 

113 via a clock terminal 114. Each I/P pad portion 202 feeds exception of the clock drivers 112 on the data paths. For 

clock signals to flip-flops 119 within a predetermined dis- 50 example, if fan-out destination clock drivers are far away so 

tance by means of clock terminals 118. that the wiring involved is necessarily long, the fan-out 

The third stage clock drivers 104 are provided as AND count tends to be small. Conversely, if clock drivers are 

circuits each having a control signal input terminal 107. All nearby, the fan-out count is likely to be large. Each clock 

third stage clock drivers 104 inside each of the logic blocks driver 112 on a data path has up to 32 flip-flops 113 within 

are connected to a signal line 108 that controls the supply of 55 the path. Thus the clock drivers 112 have greater capacities 

clock signals within the block in question. With the third than the clock drivers of the random logic circuits. For that 

stage clock drivers 104 provided as AND circuits, there is no reason, clock drivers with high driving forces are used at the 

need to provide each fourth stage clock driver 105 as an last stage to harmonize the clock delays with those of the 

AND circuit. This minimizes the number of clock drivers random logic circuits and I/O pads. Thanks to the above - 

that need to be replaced by AND circuits. A reduction in the 60 described four-stage clock logic arrangement over all paths, 

number of clock drivers replaced by AND circuits directly clock delays may be adjusted at each stage, 

translates into reduction in the lengths of the control signal Where multi-phase clock signals are used, similar logical 

lines. structures are instituted. Because all phases are matched 

The concept sketched in FIG. 1 is not limited to a with like logical structures, there occurs little key skew 

single-phase clock scheme; it is obviously applicable to 65 between the multi-phase clock signals, 

multi-phase clock arrangements as well. The semiconductor FIG. 2 shows a clock layout on a chip 201 carrying the 

integrated circuit device of this embodiment includes three semiconductor integrated circuit device embodying the 
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invention. In FIG. 2, the clock signal generator 101 is 
located in a comer of the chip 201 and adjacent to an I/O pad 
portion 202. 

All clock drivers are furnished in a cell layout region 203. 
The root clock drivers 102 are gathered together in a region 
204 near the chip center. The clock signal generator 101, 
which is vulnerable to adverse effects from other circuits, is 
located peripherally in the chip. The root clock driver 102 
located centrally in the chip extends clock wiring to the 
downstream root clock driver. This setup ensures stable 
supply of clock signals and makes it easier to minimize 
clock skew. 

Some second stage clock drivers 103 are allocated to a 
region 206 that comprises a number of logic blocks. Second 
stage clock drivers 103 assigned to the data path 209 are 
located in a clock driver layout region 207 on the clock 
terminal side of the data paths. Likewise, second stage clock 
drivers 103 destined for the I/O pad portions 202 are 
furnished in a clock driver layout region 208 on the clock 
terminal side of each pad. 

FIG. 3 depicts in detail the layout of the region 204 in 
FIG. 2. As mentioned earlier, what FIG. 3 portrays is a 
three-phase clock layout. The root clock drivers 102 are 
gathered together in regions 301 that are each adjacent to a 
power supply line 303. Reference numeral 302 denotes 
signal lines coming from the clock signal generator. 

In this multi-phase setup, the root clock drivers 102 for 
each clock phase flank a vertical power supply line 303 and 
a horizontal power supply line 304. The clock lines 302 
leading to the clock driver layout regions 301 for all clock 
phases run in parallel up to a point 305 where the lines are 
branched, the point 305 being at an equal distance from all 
clock driver layout regions. Because the clock driver layout 
regions are not concentrated on a single power supply line, 
the supply of power is stabilized. The wiring arrangement 
above makes the line lengths substantially equal for all 
phases between the clock signal generator 101 and each of 
the root clock drivers 102. 

FIG. 4 provides a more detailed view of the vicinity of one 
region 301 in FIG. 3. As illustrated, the root clock drivers 
102 in the region 301 are arranged adjacent to one another 
in the vertical direction. There is no other cell interposed 
between each root clock driver 102 and the power supply 
line 303. With the root clock drivers 102 gathered near the 
chip center, the lines ranging from the clock signal generator 
to all root clock drivers 102 are made equal in length. 
Because the maximum distance between the root clock 
drivers 102 and the second stage clock drivers 103 is 
reduced, clock delays are lowered correspondingly. Where 
the regions 301 are located adjacent to the power supply 
lines, it is possible to supply power in a stable manner to the 
regions where a plurality of root clock drivers 102 are 
gathered together. 

FIG. 5 gives a detailed view of the layout of the region 
206 in FIG. 2. The second stage clock driver 103 in the 
region 206 is located near the center of gravity of a plurality 
of third stage clock drivers 104 distributed within the same 
region. Lines making up a network 501 ranging from the 
second stage clock driver 103 to the third stage clock drivers 
104 are equalized in length. 

Each third stage clock driver 104 is allocated to a region 
502 wherein fourth stage clock drivers 105 are gathered 
adjacent to one another. Lines constituting a network 503 
ranging from the third stage clock driver 104 to the fourth 
stage clock drivers 105 are equalized in length. 

FIG. 6 gives a more detailed view of the layout of a region 
504 in FIG. 5. As illustrated, each fourth stage clock driver 
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105 is allocated to the region 504 wherein flip-flops 106 are 
gathered adjacent to one another. Lines making up a network 
601 ranging from the fourth stage clock driver 105 to the 
flip-flops 106 are equalized in length. 

5 FIG. 7 provides a set of schematic views depicting 
relations between clock drivers at different stages on the one 
hand and logic blocks on the other hand in the inventive 
semiconductor integrated circuit device. Each of the regions 
206 comprises either a plurality of logic blocks or part of a 

10 logic block. Logic blocks wherein the number of third stage 
clock drivers 104 is smaller than a reference fan-out count 
of the second stage clock driver 103 are gathered together; 
logic blocks wherein the number of third stage clock drivers 
104 is larger than the reference fan-out count are each 

15 divided into smaller regions. 

Illustratively, logic blocks 702, 703 and 704 wherein the 
number of third stage clock drivers 104 is smaller than the 
reference fan-out count are gathered together in a region 705 
handled by a second stage clock driver 701. On the other 

20 hand, a logic block 708 in which the number of third stage 
clock drivers 104 is larger than the reference fan-out count 
is divided into regions 709 and 710. The region 709 is 
handled by a second stage clock driver 706, and the region 
710 is dealt with by a second stage clock driver 707. 

25 Reference numeral 711 in this setup denotes a root clock 
driver. 

However, it is not desirable to establish a logical structure 
such as one of a region 718 that is divided into regions 714 

3Q and 717, the region 714 being handled by a second stage 
clock driver 713 connected to a root clock driver 712, the 
region 717 being dealt with by a second stage clock driver 
716 coupled to another root clock driver 715. This type of 
logical structure will give rise to a possibility that a single 

35 logic block can be subject to adverse effects of the clock 
skew over relatively long wiring between the clock signal 
generator and the root clock drivers. 

Where the number of clock drivers is smaller than the 
reference fan-out count inside a region 720 handled by a 
second stage clock driver 719, dummy cells 721 are added 
to the region to compensate for the shortage of clock drivers. 
A dummy cell is a cell of which the input capacity is the 
same as that of a clock driver connected to the same network 
and which does not use output signals of the network. 

45 As described, the fan-out count of the second clock driver 
103 may be taken as the reference value with respect to 
which adjustments are made as needed. This makes it 
possible to harmonize on all paths the clock delays stem- 
ming from the second clock drivers 103. 

50 FIG. 8 provides a set of layout diagrams illustrating 
relations between the regions handled by the second stage 
clock drivers shown in FIG. 7 on the one hand and logic 
blocks on the other hand. Illustratively, if the reference 
fan-out count of a second stage clock driver 809 is 4, then 

55 a logic block 803, in which the number of third stage clock 
drivers 802 is greater than the reference fan-out count, is 
divided into regions 810 and 811. Third stage clock drivers 
in each of the regions 810 and 811 are assigned a second 
stage clock driver 809. How to divide a logic block is 

60 determined by the arrangement of third stage clock drivers 
802 furnished therein. If the clock driver count in a divided 
region is smaller than the reference fan-out count, then 
previously furnished dummy cells 814 are used to take the 
place of third stage clock drivers 802 to compensate for the 

65 shortage of clock drivers. Meanwhile, each of logic blocks 
804, 805, 806, 807 and 808 has a smaller number of third 
stage clock drivers 802 than the reference fan-out count. In 
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such cases, adjacent logic blocks are gathered together to 
form a single region to which a second stage clock driver 
809 is allocated. 

In FIG. 8, the logic blocks 804, 805 and 806 are combined 
into a region 812, and the logic blocks 807 and 808 into a 
region 813. Where the number of third stage clock drivers 
802 is smaller than the reference fan-out count inside a 
combined region, previously furnished dummy cells 814 are 
utilized to compensate for the shortage with respect to the 
fan -out count of the second stage clock driver 809. 

FIG. 9 is a detailed view of clock drivers laid out in data 
paths 209 shown in FIG. 2. A clock terminal 902 is allocated 
to each column of flip-flops 901 in the data paths 209. The 
clock terminals 902 are arranged so as to line up on one side 
of the data paths 209. 

A clock driver layout region 207 is provided on a cell 
layout region 907 on the side of the clock terminals 902 for 
the data paths. Inside the clock driver layout region 207 are 
third stage clock drivers 905 and fourth stage clock drivers 
906. 

The clock driver layout region 207 is also arranged to be 
adjacent to a power supply line 904. If clock drivers of the 
data paths are located on the cell layout region, it is possible 
to gather clock drivers together where the clock terminals 
are concentrated. This helps prevent a surge in clock delays. 
Providing the clock driver layout region forestalls increases 
of distances up to the clock drivers. Although the clock 
drivers of the data paths are considerably concentrated in 
terms of layout because of their numerous clock terminals, 
locating the clock driver layout region adjacent to the power 
supply line ensures stable supply of power. 

Although not shown, there exist a large number of third 
stage clock drivers 905 of the data paths. In this setup, the 
wiring between the second stage clock drivers 103 and the 
third stage clock drivers 905 of the data paths is furnished as 
follows: a plurality of third stage clock drivers are grouped 
together, and the wiring within that group is made as short 
as possible. Lines between the second stage clock drivers 
103 and the respective groups of third stage clock drivers are 
equalized in length. 

FIG. 10 provides a detailed view of clock drivers laid out 
in the I/O pad portion 202 shown in FIG. 2. A clock terminal 
1002 is allocated to each flip-flop 1001 inside the I/O pad 
portion 202. The clock terminals 1002 are arranged so as to 
line up on one side of the I/O pad portion 202. A clock driver 
layout region 208 is furnished on a cell layout region 1006 
on the side of the clock terminals 1002 in the I/O pad portion 
202. Inside the clock driver layout region 208 are third and 
fourth stage clock drivers 1004 and 1003 arranged in a row, 
each third stage clock driver being flanked by a plurality of 
fourth stage clock drivers. A reference wiring length is set 
for the fourth stage clock drivers 1003, and as many clock 
terminals 1002 as a reference fan-out count are allocated 
within the reference wiring length. This arrangement is 
adopted here because the number of clock terminals are 
small despite the long distance occupied by them in the I/O 
pad portion 202. 

If there are fewer clock terminals within the reference 
wiring length 1007 than the reference fan-out count, then 
dummy cells 1005 are added to compensate for the shortage. 

When the layout regions are furnished as described, any 
increases in the distances up to the clock drivers are sub- 
stantially prevented. The use of numerous dummy cells 
makes it possible to harmonize clock delays despite the 
presence of sparsely arranged clock terminals. 

The dummy cells 1005 should preferably be arranged in 
the same row as that of a plurality of clock drivers as 
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illustrated in FIG. 10. That is because the arrangement 
facilitates adjustment of the wiring lengths while minimiz- 
ing increases in occupied areas. 
In FIG. 10, one fourth stage clock driver 1003 is furnished 

5 corresponding to four clock terminals. However, the one- 
to-four correspondence is not limitative of the invention. For 
example, suppose that each fourth stage clock driver 1003 is 
assigned 12 terminals and that only one flip-flop 1001 is 
connected to a fourth stage clock driver 1003. In that case, 

io 11 dummy cells 1005 may be connected to the fourth stage 
clock driver 1003 in question. 

In the I/O pad portion 202, each flip-flop 1001 is associ- 
ated with an input/output circuit 1008 and an I/O pad 1009 
which are arranged in the direction of a chip edge. In the 

15 inventive semiconductor integrated circuit device, the logic 
circuits inside of the I/O pad portions 202 use signals with 
an amplitude of 1.8 V, and are interfaced to signals with an 
amplitude of 3.3 V from outside the chip. The interface 
capability is implemented by use of a level shifter circuit 

20 arrangement. More specifically, each I/O circuit 1008 
includes a three-state logic circuit, a level shifter circuit and 
an I/O buffer circuit arranged in that order starting from the 
flip-flop side. These circuits are connected to an I/O pad 
1009. 

25 FIG. 11 gives a conceptual view illustrating how different 
stages of the inventive semiconductor integrated circuit 
device are typically wired, lines 1102 and 1103 are equal- 
ized in length and constitute a binary tree structure. Lines 
1104 and 1105 are made as short as possible. That is, the 

30 lines at a higher stage where fan-out destination cells are 
distributed extensively are equalized in length; wiring at a 
lower stage where fan -out destination cells are narrowly 
distributed is made the shortest possible wiring. Length 
differences (i.e., between a maximum and a minimum 

35 length) between clock lines equalized in length are smaller 
than length differences between clock lines that are made as 
short as possible. 
Wiring 1101 is provided at the highest stage. However, 

4Q since this wiring involves root clock drivers 102 gathered 
together as shown in FIG. 4, it is prepared as the shortest 
possible wiring. 

The lines 1102 and 1103, with their fan-out destination 
cells distributed extensively, are equalized in length on all 

45 paths. This arrangement helps harmonize clock delays over 
the paths. 

The above adjustments are made possible because the 
number of clock drivers the upper stages is limited. The 
lower the stage, the greater the number of clock drivers 

50 installed. Thus the lines are made as short as possible at 
lower stages in order to reduce the overall wiring length, 
boost packaging density and minimize line-induced delays. 
Because the wiring is shorter at lower stages, clock skew 
stemming from the line-induced delays is negligible there. 

55 At higher stages where extended wiring promotes vulner- 
ability to delays, the lines involved are equalized in length 
so as to reduce the clock skew caused by the line-induced 
delays. When all lines are equalized in length on all paths, 
differences in load capacity between clock drivers are elimi- 

60 nated. 

FIG. 12 offers a set of explanatory views indicating how 
differences between clock delays are reduced over different 
paths by use of the clock wiring 1102. It may happen that 
differences in clock delay 1202 exist between second stage 
65 clock drivers 103 and flip-flops 106. Such differences, if they 
occur, are reduced by modifying the configuration of the 
lines 1102 which are basically equalized in length and which 
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constitute a binary tree structure. Specifically, clock delays 
1201 are adjusted between the root clock drivers 102 and the 
second stage clock drivers 103. For example, if there are 
clock delay differences between each of two second stage 
clock drivers 103 connected to a line 1102 on the one hand 
and the corresponding flip-flops 106 on the other hand, the 
lengths of lines 1205 and 1206 between a junction 1207 and 
the second stage clock drivers 103 are adjusted at the point 
1207 in such a manner that the clock delay differences are 
removed. Any clock delay differences that may occur 
between another root clock driver 102 and the second stage 
clock drivers 103 are eliminated by adjusting the length of 
a line 1204 between the clock driver 102 and the junction 
1207. Such adjustments, which are relatively simple in 
procedure and required at only a small number of locations, 
may be carried out manually. 

Where clock drivers of high driving forces are used, lines 
wider than usual need to be employed to counter migration. 
Because the incidence of migration is proportional to the 
strength of current, the wiring need only be composed of 20 
wide lines up to first junctions beyond which the current 
strength is reduced by half. Along clock wiring 1301 
between a second stage clock driver 103 and third stage 
clock drivers 104 in FIG. 13, a portion 1302 is made of a 
wide line (having twice the width of ordinary wiring) as 
shown in FIG. 14. The rest of the wiring has the ordinary 
width such as that of a portion 1401. An output terminal 
1402 of each second stage clock driver 103 is shaped as a 
rectangle at least as broad as the wide line so that the latter 
may be connected properly to the terminal 1402. Where 
there are a limited number of locations requiring wide -line 
wiring, packaging density is improved. 

Wide-line wiring is not limited to the clock wiring 
between the second stage clock drivers 103 and the third 
stage clock drivers 104. It is also possible to install wide 
lines up to the first junctions along the clock wiring between 
the root clock drivers 102 on the one hand and the second 
clock drivers 103 on the other hand. 

As described, a semiconductor integrated circuit device 
having the inventive clock layout is subject to significantly 
reduced wiring delays, has increased packaging density, and 
provides a clock skew-lowering layout involving decreased 
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clock power dissipation. The device also has an optimally 
arranged clock layout for each functional portion of the LSI. 

As many apparently different embodiments of this inven- 
tion may be made without departing from the spirit and 
scope thereof, it is to be understood that the invention is not 
limited to the specific embodiments thereof except as 
defined in the appended claims. 

What is claimed is: 

1. A semiconductor integrated circuit device comprising: 

a clock signal generator for outputting clock signals; 
a plurality of flip-flops for receiving said clock signals 

from said clock signal generator through clock lines; 

and 

a plurality of stages of clock drivers furnished on clock 
lines ranging from said clock signal generator to said 
flip-flops; 

wherein differences between a maximum and a minimum 
length of the clock lines between first stage clock 
drivers and second stage clock drivers are smaller than 
differences between a maximum and a minimum length 
of the clock lines between last stage clock drivers and 
said flip-flops. 

2. A semiconductor integrated circuit device according to 
claim 1, wherein the clock lines between said first stage 
clock drivers and said second stage clock drivers are equal- 
ized in length, and wherein clock lines between said last 
stage clock drivers and said flip-flops have the shortest 
possible lengths. 

3. A semiconductor integrated circuit device comprising: 
a clock signal generator for outputting clock signals; and 
a plurality of stages of clock drivers furnished on clock 

lines coming from said clock signal generator; 
wherein, at least either between first stage clock drivers 
and second stage clock drivers, or between second 
stage clock drivers and third stage clock drivers, clock 
lines up to first junctions are each made greater in line 
width than the corresponding clock lines beyond said 
first junctions. 
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(57) ABSTRACT 

A clock tree synthesizer calculates balanced cluster sets of 
nodes a particular je yeLof.adock tree in a circuit description 
b ase^ 1 on a " set^of ^yajlable~buffer t ypes EacrPbalanced^ 
clustenset-isltested- to see f if:it:me'ets^Mesign:constrainr? If 
the design constraint is not met for a particular balanced 
cluster set, the particular cluster set is removed from con- 
sideration in the clock tree solution. For the cluster sets that 
do meet the design constraint, a cost associated with each 
cluster set is calculated. A balanced cluster set that has the 
lowest cost is selected for the clock tree solution. In one 
embodiment, the lowest cost balanced cluster set for one 
level in the clock tree forms the nodes for the next higher 
level in the clock tree, and the process is repeated at each 
level of the clock tree up to a root node. In another 
embodiment, the clock tree in the circuit description is 
modified with t he lowest cost balan ceo^lusjer-set-for each 
level of thexlb^ktree'solution7whereineach cluster includes 
the;buffer-on : whicrFth^^lustercalculationiwas-based. 

20 Claims, 8 Drawing Sheets 
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METHOD AND APPARATUS FOR CLOCK SUMMARY OF THE INVENTION 

TREE SOLUTION SYWHESI^BASED ON ^ e ^ benefidaUy provides an 

DILMLriN tUNM KAliN l & me\hod and apparatus for synthesizing clock tree solutions. 

FIELD OF THE INVENTION 5 At a P ar ti cu l ar * eve l °f a clock tree in a circuit description, 

balanced cluster sets of nodes are calculated based on a set 

The present invention pertains to the field of integrated of available buffer types. Each balanced cluster set is tested 

circuit (IC) design. More particularly, this invention relates to see if it meets a design constraint. If the design constraint 

to the art of synthesizing clock tree solutions. is not met for a particular balanced cluster set, the particular 

cluster set is removed from consideration in the clock tree 

BACKGROUND OF THE INVENTION « ^lutim. For the cluster sets that do meet the design 

Since the advent of the integrated circuit (IC), circuit constraint, a cost associated with each cluster set is calcu- 

components have become smaller and smaller. An IC may Iated - A balanced cluster set that has the lowest cost is 

include millions of components packed into an incredibly selected for the clock tree solution, 

small package. With each new generation of smaller 15 In one embodiment, the lowest cost balanced cluster set 

integration, more functionality, and therefore more value, for one level in the clock tree forms the nodes for the next 

can be derived from ICs. Reliably manufacturing these higher level in the clock tree, and the process is repeated at 

highly integrated ICs, however, presents significant design each level of the clock tree up to a root node. In another 

challenges. embodiment, the entire clock tree is tested to see if it meets 

In particular, designing ICs that meet timing constraints 20 a desi g n constraint. In another embodiment, the 

can be particularly difficult. An IC may include tens of clock tree is tested for setup time and/or hold time 

thousands of registers that need to be connected to one or violations, and register positions within the clock tree are 

more clock sources. For each clock "tick", or clock changed to eliminate any violations. In another embodiment, 

transition, thousands of registers have to operate in concert. lhe dock tree in the circuit description is modified with the 

A complex network is needed to propagate the clock signal 25 lowest balanced cluster set for each level of the clock 

to each of the registers. If the difference in propagation delay tree solution, wherein each cluster includes the buffer on 

through two different paths in the network is too large or too whlch me cluster calculation was based, 

small, errors may occur that can cause the entire IC to fail. BRI£p DESCRIpT[0N 0F ^ DRAWINGS 

Those skilled in the art will be familiar with numerous 

processes for synthesizing clock networks, or clock tree 30 Examples of the present invention are illustrated in the 

solutions. One of the most common approaches is a binary accompanying drawings. The accompanying drawings, 

clock tree. A binary clock tree often begins by coupling however, do not limit the scope of the present invention, 

registers into pairs. Then, pairs of register pairs are coupled Similar references in the drawings indicate similar elements, 

together, pairs of pairs of register pairs are coupled together, FIG. 1 illustrates one embodiment of an IC design, 

and so on until the clock source, commonly referred to as the 35 mG 2 illustrates one embodiment of the IC design with 

"root" or root node, is reached. a clock ^ solution. 

The result is a clock tree having a root and a series of nG 3 niustrates one embodiment of the present inven- 

branches reaching out to the registers. The registers are ^ on 

commonly referred to as "leaf nodes" on the tree. Between ^« ^n.* c lj* *r«u 

, . . r . . •_ c 40 FIG. 4 illustrates a process of one embodiment of the 

the root and the leaf nodes there may be several levels of r 



intermediate nodes where paths branch. 



present invention. 

„ , . . j u*ujj * • ~ *<m,i FIG. 5 illustrates a clock tree for which hold time and 

Each register and each path adds a certain amount or load . ... j . L . . j 

. .1. . %i_ . it * j * u * • * setup tune violations need to be tested, 

to the tree. The root usually cannot drive enough current into r 

the tree to operate the cumulative load. In order to handle NG. 6 illustrates a timing diagram with a clock skew 

large trees, buffers are inserted into the tree at various between clock signals at two registers, 

intermediate nodes. Buffers receive a signal from an FIG. 7 illustrates one embodiment of a machine used to 

upstream driver, such as another buffer or the root node, and implement the present invention. 

drive the signal to a number of down stream nodes. FIG. 8 illustrates one embodiment of a machine readable 

A wide variety of approaches have been used to insert 50 storage medium to store instructions embodying the present 

buffers in clock trees. For instance, the number of nodes invention. 

coupled to a root may be counted, and one or more buffers rare™ ion nw 

inserted as needed. Then, each buffer can be treated like a DETAILED DESCRIPTION 

root in a "sub-tree," and nodes can be counted and buffers i n the following detailed description, numerous specific 

inserted to create further sub-trees in a hierarchy that reaches 55 details are set forth in order to provide a thorough under- 

oul to the leaf nodes. Various design constraints can be standing of the present invention. However, those skilled in 

tested, and the process repealed with different types of the art will understand that the present invention may be 

buffers and tree structures until a suitable solution is found. practiced without these specific details, that the present 

As ICs continue to become more complex, having tens of invention is not limited to the depicted embodiments, and 

thousand of registers which may be clocked by several go that the present invention may be practiced in a variety of 

different source clocks, at several different clock alternate embodiments. In other instances, well known 

frequencies, through gated clocks, inverted clocks, etc., the methods, procedures, components, and circuits have not 

processing time and expense required to meet continually been described in detail. 

more stringent design constraints using known approaches is parts of the description will be presented using terminol- 

becoming increasingly prohibitive. $5 0 gy commonly employed by those skilled in the art to 

Therefore, an improved method and apparatus for syn- convey the substance of their work to others skilled in the 

thesizing clock tree solutions is needed. art. Also, parts of the description will be presented in terms 
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of operations performed through the execution of program- FIG. 3 illustrates one embodiment of clock tree synthe- 

ming instructions. As well understood by those skilled in the sizer 320 coupled to electronic design automation (EDA) 

art, these operations often take the form of electrical, system 310. Except for the teachings of the present 

magnetic, or optical signals capable of being stored, invention, EDA system 310 represents any of a broad 

transferred, combined, and otherwise manipulated through, 5 category of EDA systems. For instance, EDA system 310 

for instance, electrical components. may include capabilities for generating a gate-level IC 

Various operations will be described as multiple discrete design from hardware description language (HDL) files, 

steps performed in turn in a manner that is helpful in including provision of a timing budget, generation of a floor 

understanding the present invention. However, the order of P lan > synthesis of gates, placement of gates, and routing of 

description should not be construed as to imply that these ™ transmission paths. 

operations are necessarily performed in the order they are EDA system 310 provides clock tree synthesizer 320 with 

presented, or even order dependent. Lastly, repeated usage input 330. Input 330 includes a circuit description in any of 

of the phrase "in one embodiment" does not necessarily a number of formats. In one embodiment, input 330 includes 

refer to the same embodiment, although it may. component placement, timing constraints, and a set of 

The present invention provides an improved method and 15 available buffers. In response to input 330, clock tree syn- 

apparatus for synthesizing clock tree solutions in integrated thesizer 320 provides output 340 which defines a clock tree 

circuit designs. FIG. 1 illustrates a very simple example of solution for the circuit description. In one embodiment, EDA 

an integrated circuit (IC) design 110 as it may be defined, for svstem 310 the clock tree solution to route the clock 

instance, by a netlist prior to synthesizing a clock tree tree iD me IC design. 

solution. Eight registers (R) 120 and four blocks of combi- 20 FIG. 4 demonstrates one embodiment of clock tree syn- 

oational logic (L) 130 are placed in the circuit design and thesizer 320. In general terms, clock tree synthesizer 320 

coupled to each other and to input pins (D1,D2,D3, and D4) groups clock tree nodes on a level-by-level basis starting 

and output pins (Ql, Q2, Q3, and Q4) as shown. The netlist with the level of nodes furthest from the root node. At each 

also defines a clock tree. Input clock pin (CLK) 140 is level, clusters of nodes are calculated in various ways 

coupled to the clock pin of each of the registers 120. The 25 depending on available buffer types, and a best cluster set for 

netlist defines all of the connections but does not define how each level of the clock tree solution is selected based on a 

the connections are made. c °st analysis. The embodiment illustrated in FIG. 4 includes 

CLK 140 is a root node in the clock tree and each register a Dumber of implementation specific details and various 

is a leaf node in the clock tree. If CLK 140 cannot drive 3Q alternate embodiments. 

enough current to operate all eight registers 120, one or more In block 410, input data is received. In the illustrated 

buffers need to be inserted in the clock tree. In a simple embodiment, the input data includes component placement 

integrated circuit design like the one illustrated in FIG. 1, and timing constraints for a circuit design. Component 

buffers could probably be inserted manually, for instance, by placement includes coordinate locations of clock pins in one 

modifying the netlist using any of a number of user inter- - or more clock trees in the circuit design, such as source clock 

faces. FIG. 2 illustrates IC 110 with a modified clock tree locations and the locations of clock inputs on each register, 

including buffer 210 to drive the clock signal to each of the and a definition of component connections. Component 

registers 120. placement may be in the form of a netlist. 

Manually modifying clock trees becomes much more Timing constraints may include data such as minimum 
difficult when circuits become more complex and design 40 and maximum propagation delay from the source clock to a 
constraints become more stringent. For instance, timing clock pin of any register, minimum and maximum clock 
constraints for IC 110 may include minimum and maximum transition time at any register, hold time and setup time 
propagation delay from CLK 140 to the registers 120, requirements for each type of register, and propagation 
minimum and maximum clock transition time at each reg- delays through components such as registers, combinational 
ister 120, minimum and maximum delay through each logic 45 l°g"=> buffers, inverters, clock dividers, clock multipliers, 
block 130, and required setup and hold times for each etc. Timing constraints may also include propagation con- 
register 120. The timing constraints may be very stringent, stants for calculating propagation delays through lengths of 
requiring a balanced solution with very little deviation in transmission paths. 

delay from one clock path to the next. Additional design In various embodiments, input data may also include 

constraints may state that CLK 140 can only drive one 50 design constraints defining one or more source clock wave 

buffer, buffers that can be used in IC 110 can only drive up forms, available area for inserting buffers and/or inverters, 

to three registers, each buffer introduces a certain amount of available types of buffers and/or inverters including maxi- 

propagation delay and increases transition time by a certain mum load for each buffer/inverter and area required to insert 

amount, and the area available on IC 110 to add buffers is each buffer/inverter, and available layers in the IC design for 

extremely limited, leaving room for no more than four 55 clock tree routing. 

buffers. With these design constraints, manually synthesiz- Input data may also define pre -designed partial trees, also 

ing a clock tree solution for even the simple IC design show called sub-trees or macro-cells. The partial trees can be 

in FIG. 1 is no trivial matter. For today's highly integrated treated as a single terminal node from the perspective of the 

circuits, often including tens of thousands of registers, clock tree synthesizer. The input data for partial trees may 

manually synthesizing a clock tree solution is virtually 6 q include maximum and minimum propagation delay from the 

impossible. source clock up to the root of the partial tree, as well as the 

The present invention beneficially synthesizes clock tree load the partial tree places on the clock tree, 

solutions using design constraints and cost analysis to insert Certain aspects of the input data can be user defined. For 

buffers and select node clusters that provide superior, bal- instance, in one embodiment, a default set of available 

anced clock trees. A clock tree solution can then be added to 65 buffers will be used unless a user defined set of buffers is 

a circuit description that is later used to route the connec- included in the input data. The input data may also indicate 

tions in the IC design. certain user defined nodes that should be ignored or are to be 



04/21/2004 # EAST Version: 1.4.1 



US 6,367,060 Bl 



treated as leaf nodes or terminal nodes, as in the case of a 
partial tree discussed above. Similarly, where two clock 
trees overlap, a user may be required to define certain nodes 
to be terminal nodes in order to separate the clock trees as 
viewed from the perspective of the clock tree synthesizer. 

In block 420, leaf nodes for a given clock tree are 
identified. For instance, a root node can be selected from a 
netlist and all of the registers coupled to the root node as 
defined by the netlist can be identified. The set of identified 
leaf nodes may include tens of thousands of nodes. The set 
of leaf nodes comprise the outer most level of the clock tree. 

Partial trees, as mentioned above, may be treated as 
terminal nodes in a clock tree. The timing constraints at the 
root of a partial tree, however, are likely to be different from 
the timing constraints at registers. For instance, the propa- 
gation delay from the clock tree root to each register in the 
clock tree must fall within a specified range, but the propa- 
gation delay from the clock tree root to the root of a partial 
tree may not fall within the same specified range. In which 
case, the partial tree needs to be given special consideration 
during clock tree synthesis. Partial trees, as well as other 
types of terminal nodes that are not leaf nodes, will be 
discussed more fully below. 

Continuing on with FIG. 4, in block 430, balanced cluster 
sets of the leaf nodes are calculated based on the available 
types of buffers. For instance, in one embodiment, the set of 
available buffers includes five types of buffers that can drive 
loads up to 15, 10, 5, 3, and 2 registers respectively. In which 
case, if there are 150 thousand registers at the leaf node level 
in the clock tree, the registers could be clustered into a first 
balanced set of 10 thousand clusters of nodes driven by 10 
thousand 15-output buffers, or a balanced set of 15 thousand 
clusters driven by 15 thousand 10-output buffers, or a 
balanced set of 30 thousand clusters driven by 30 thousand 
5 -output buffers, and so on for each buffer type. 

Each buffer type has associated with it a certain amount 
of propagation delay and each buffer type impacts clock 
transition times at the registers by a certain amount. Each set 
of calculated clusters is "balanced" in that the same buffer 
type is used for the entire level of leaf nodes so that the 
timing constraints at each register is similarly impacted. The 
goal of the ideal clock tree, of course, is for each register to 
receive the clock signal at exactly the same time so that there 
is no clock skew between registers. For instance, if the 
15-output buffer has a 5 nanosecond propagation delay, then 
the balanced set of registers all clustered using the 15-output 
buffer will all experience a 5 nanosecond delay. 

Realistically, the total number of nodes will not be evenly 
divided by the number of nodes that can be driven by a 
buffer type. Buffers, however, can drive fewer than the 
maximum number of nodes. In which case, one buffer type 
can be used for more than one set of clusters. For instance, 
if there are 145 thousand registers, a cluster set could include 
9667 clusters driven by 9667 15-output buffers, all driving 
15 registers except one buffer which drives 10 registers. Of 
course, propagation delay through a buffer may depend on 
the load, i.e. the number of nodes. In which case, the one 
15-output buffer driving only 10 registers may have a 
significantly shorter propagation delay or transition time, 
potentially creating clock skew between registers. In which 
case, in one embodiment, a "balanced" set of clusters is also 
one that attempts to evenly distribute the number of nodes 
over the number of buffers. For instance, 145 thousand 
registers could be driven by 9662 buffers driving 15 registers 
each and 5 buffers driving 14 registers each. The difference 
in propagation delay between driving 14 and 15 buffers may 
be negligible. 
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In any event, a potentially large number of possible 
balanced cluster sets are calculated. Any number of tech- 
niques can be used to derive the factorization of the number 
of nodes by the set of available buffers. 

In block 440, each calculated cluster set is tested against 
a design constraint. In the illustrated embodiment, the tested 
design constraint is the clock transition time at the registers. 
For instance, the timing constraints may require that the 
clock signal at each register must transition from high to 
low, or low to high, in a minimum of 1 nanosecond and a 
maximum of 2 nanoseconds. Certain buffer types, driving 
certain numbers of nodes, may not meet the timing con- 
straint. 

Timing constraints are often process dependent, meaning 
that, in order for the type of register being used in a 
particular IC design to have a known state, the constraints 
must be met. If the constraints are not met, errors may occur. 
In which case, in block 445, any cluster set that does not 
meet the timing constraint is removed from consideration. Id 
the illustrated embodiment, a cluster set is effectively 
removed from consideration by setting a cost associated 
with the cluster set to a large value so that it will not be 
selected in the costing analysis discussed below for block 
455. 

In block 450, a cost is calculated for cluster sets that met 
the timing constraint tested in block 440. In one 
embodiment, cost is equal to the cumulative area necessary 
to insert the set of buffers plus a cost factor times the 
propagation delay for the buffer type. That is: 

COST-AREA+a (DELAY) 

Different buffer types require different amounts of area. The 
area for each buffer type may be defined in the input data or 
retrieved from a default library. In general, buffers which 
drive larger maximum loads require more area. Also, buffers 
tend to have longer propagation delays for larger loads. A 
15-output buffer used to drive only 10 nodes may actually 
have shorter delay, but require more area, than a 10-output 
buffer used to drive the same 10 nodes. That is, depending 
on which component of the cost equation is emphasized 
based on the value of a, certain cluster sets may have lower 
associated cost. The cost calculation, and the cost factor a 
are discussed more fully below. In alternate embodiments, 
any number of cost equations can be used. 

In block 455, a cluster set having the lowest calculated 
cost is selected for the clock tree solution. That is, for the 
given cost equation with a given value for a, the best cluster 
set for the leaf node level is selected. Since the cluster sets 
that did not meet the timing constraint in block 440 were set 
to large values, such as orders of magnitude greater than any 
reasonable cost value anticipated using the given cost 
equation, those cluster sets are effectively removed from 
consideration. 

In alternate embodiments, cost is calculated for all cluster 
sets, not just those that meet the design constraint. In which 
case, the design constraint may be tested, and cluster sets 
that do not meet the design constraint removed from 
consideration, at any point prior to selecting the lowest cost 
cluster set. 

In block 460, the clock tree synthesizer determines 
whether or not the root node has been reached. The root node 
has been reached if all of the nodes in the current level (the 
leaf node level at this point in the process) are coupled to the 
root node. Only in very small IC designs will the leaf node 
level couple directly to the root node. In the example from 
above, including 150 thousand registers, the leaf node level 
is many levels removed from the root node. 
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In block 465, if the root node has not been reached, the 
nodes for the next level of the clock tree are identified. The 
buffers forming the clusters for the previous level comprise 
the set of nodes for the next level of the clock tree. For 
instance, if the previous level included 150 thousand leaf 5 
nodes, and the lowest cost cluster set was 10 thousand 
15 -output buffers, the nodes for the next level are the 10 
thousand buffers. In which case, the process returns to block 
430, balanced cluster sets are calculated for 10 thousand 
nodes, and the best cluster set for the 10 thousand-node level 10 
of the clock tree solution is selected based on the cost 
calculation. Levels of the clock tree are built one on top of 
the other until the root node is reached in block 460. 

The result is a multi-level clock tree solution. Each level 
is "balanced" in that each level is driven by one type of 15 
buffer (or inverter as the case may be). Within each level, in 
certain embodiments, clusters are also "balanced" in that 
each cluster includes roughly equally numbers of nodes. The 
clock signal is propagated to each register through the same 
number of the same types of buffers. 20 

In block 470, additional design constraints are tested. In 
one embodiment, only one additional design constraint is 
tested — the cumulative propagation delay from the root 
node to the registers. The propagation delay to every leaf 
node must be between a minimum and maximum value. 
Alternate embodiments may test the delay and the cumula- 
tive area, or any number of other design constraints or 
combinations of constraints. If any of the constraints are not 
met, the clock synthesizer proceeds to block 475. 

In one embodiment the values for the cost equation are 
cumulated from one level to the next so that, for instance, 
cumulative delay and area constraints can be tested in block 
470 without recalculating the values. 

In block 475, if the additional design constraints are not 
met, the cost equation is adjusted. In the illustrated 
embodiment, the cost equation is adjusted by changing the 
value of the cost factor a. Then the process returns to block 
420 and begins to build a new clock tree solution starting at 
the leaf node level. Changing the cost factor changes the 
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iteration, the cost factor can be set to a value half way 
between zero and the maximum value for the next iteration. 

At the end of the next iteration, if the propagation delay 
is less than the minimum allowable delay, the cost factor 
should be decreased from the midpoint in the binary search 
range in order to place more emphasis on the area element 
of the cost equation, likely resulting in a longer propagation 
delay. Conversely, if the propagation delay is still more than 
the maximum allowable delay, the cost factor should be 
increased again. In which case, the cost factor should be 
either be decreased to halfway between the current value and 
zero, or increased to halfway between the current value and 
the maximum value. With each subsequent iteration, the 
binary search range gets smaller and smaller because the 
cost value is either increased by half from the previous value 
to reduce delay or decreased by half from the previous value 
to increase delay until an acceptable clock tree solution is 
found. 

Any number of alternate search techniques can be used in 
alternate embodiments to adjust the cost equation from one 
iteration to the next. In one alternate embodiment, the design 
constraints are tested after each level is added to the clock 
tree solution rather than waiting until the end of a complete 
iteration. 

In block 470, if the tested design constraints are met, the 
process proceeds to block 480. In block 480, the clock tree 
is tested for setup time and/or hold time violations. In block 
490, register positions are changed in the clock tree as 
needed in order to correct any setup time and/or hold time 
violations. The functions of blocks 480 and 490 are dis- 
cussed below in more detail with respect FIGS. 5 and 6. 

In block 495, the clock tree synthesizer outputs the 
acceptable clock tree solution. 

As discussed above, terminal nodes which are not leaf 
nodes (nodes at registers) require special consideration. The 
design constraints associated with terminal nodes are often 
not the same as design constraints associated with leaf 
nodes. For instance, a terminal node that is a root node for 
a partial tree, a gated clock, a divided clock, etc., is likely to 



emphasis of the cost equation so that different buffer types 40 have different propagation delay constraints. If the accept - 



and cluster structures are likely to be selected. 

In one embodiment, the cost factor is adjusted from one 
iteration of building a clock tree solution to the next using 
a binary search until the design constraints are met or no 
acceptable solution can be found. For instance, a range of 45 
acceptable cost factor values may go from zero up to some 
maximum value. For a first iteration through the process, the 
cost factor can be set to zero. In which case, recalling the 
cost equation, 

_ 50 
CX)ST-AREA+a(DELAY), 

if the cost factor a is zero, cost will be equal to area. In other 
words, the first iteration will select the clock tree solution 
that has the lowest required area without any consideration 
whatsoever for delay. In one embodiment, at the end of the 55 
first iteration, the area and delay constraints can be tested. If 
the area constraint is not met, then the clock tree may not be 
physically possible, and the process may end. 

For the delay constraint, the propagation delay must be 
between the minimum and maximum propagation delay. In 60 
general, larger cumulative area often translates into shorter 
overall propagation delay, and longer propagation delay 
often translates into smaller area. If the propagation delay is 
longer than the maximum allowable delay, the cost factor 
should be increased so delay is more emphasized in the cost 65 
equation, likely resulting in a larger cumulative area. In the 
binary search, starting with a zero cost factor in the first 



able range of propagation delay for a terminal register is 
longer the acceptable range for registers, any number of 
techniques can be used to add delay to the terminal node so 
that the terminal node can fit into the clock tree at the leaf 
node level and be included in block 420 for the first iteration 
of the process illustrated in FIG. 4. 

Fitting terminal nodes into the clock tree is more difficult 
if the range of acceptable delay values is narrower from 
minimum value to maximum value than for registers, or the 
maximum allowable delay for a terminal node is shorter than 
the minimum allowable delay for the register. In one 
embodiment, both of these cases are addressed by cumulat- 
ing delay for each level of buffers as they are added to the 
clock tree. When the cumulative delay of the levels of 
buffers are approximately equal to the difference in the 
terminal node delay and the register delay, the terminal node 
is included in the set of nodes used for the next level of the 
clock tree solution. That is, in block 465 of FIG. 4, identi- 
fying nodes for the next level of the clock tree solution 
includes comparing cumulative delay of the current level of 
the clock tree with the difference between terminal node 
delay constraints and register delay constraints, and adding 
terminal nodes to the next level if the constraints match, or 
match to within a particular deviation range. 

FIG. 5 illustrates a simple example of a register for which 
setup time and hold time violations need to be tested as 
mentioned above in FIG. 4, block 480. Clock source 510 is 
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coupled to clock tree 515. Clock tree 515 is "balanced**. example illustrated in FIG. 5, propagation delay through 

Clusters sizes based on buffers 517 are equal. Each cluster buffer 517 coupled to register 520 is slightly different from 

within a level is driven by the same kind of buffer, buffers the propagation delay through buffer 517 coupled to register 

517. Each register is separated from clock source 510 by the 530. So, to reduce the skew, both registers should be coupled 

same number and the same kind of buffers, one buffer 516 5 to the same buffer 517 so that they both experience the same 

and one buffer 517. Ideally, each register receives the clock delay through the same buffer. 

signal at exactly the same time. Realistically, however, slight In general terms, in order to reduce skew, registers can be 

process variations from buffer to buffer and slightly different swapped from cluster to cluster in the clock tree so that a 

path lengths can result in slight variations in cumulative dependent pair of registers share as many common buffers as 

propagation delays experience by two different registers. 10 possible. Basically, this means that a dependent pair of 

The difference in delays is called "skew*'. registers should be clustered as low as possible in the clock 

When register 520 receives a rising clock edge at clock tree, as near to the leaf node level as possible, 
input 525, a value at data input 540 is latched in and passed Another approach to reduce hold time and setup time 
to output 545. The value passes through combinational logic violations within a cluster is to change the order in which a 
526 and a modified value arrives at the data input 550 of 15 pair of dependent registers are clustered. That is, the inde- 
register 530 after a certain amount of delay. The delay from pendent register should be coupled to the buffer immediately 
the rising edge at clock input 525 to a value arriving at data followed by the dependent register. Generally, this will 
input 550 is somewhere between the minimum delay 556 result in a shortest possible variation between transmission 
and the maximum delay 557. When the circuit design is paths. FIG. 7 is intended to represent a broad category of 
operating properly, the value at data input 550 will be 20 computer systems. In FIG. 7, processor 710 includes one or 
clocked into register 530 at the next rising clock edge. The more microprocessors. Processor 710 is coupled to tempo- 
registers are said to be a "dependent** pair of registers, in rary memory 760 by high speed bus 770. High speed bus 770 
which register 520 is independent and register 530 is depen- is coupled to Input/Output bus 750 by bus bridge 780. 
dent. Permanent memory 720 and Input/Output devices, including 

FIG. 6 illustrates one embodiment of a timing diagram for 25 display device 740, keyboard 730, and mouse 790, are also 

the clock signal received at clock inputs 525 and 535 from coupled to Input/Output bus 750. In certain embodiments, 

FIG. 5. The difference in the propagation delays from the one or more components may be eliminated, combined, 

clock source to the respective registers results in skew 661. and/or rearranged. A number of additional components may 

That is, the clock signal at register 530 is slightly behind the also be coupled to either bus 750 and/or 770 including, but 

clock signal at register 520. Design constraints for the 30 not limited to, another bus bridge to another bus, one or 

registers include hold time 662 and setup time 663. If a value more disk drives, a network interface, additional audio/video 

at a data input for a register changes during a setup time interfaces, additional memory units, additional processor 

before a clock edge or the hold time after a clock edge, the units, etc. 

value that appears at the output will be unknown. That is, if Clock tree synthesizer 320, as shown in FIG. 3, can be 

the hold time or setup time design constraints are violated, 35 executed by processor 710 as a series or sequence of 

the state of the IC will be unknown. Therefore, the value at machine readable instructions or function calls stored, for 

data input 550 can only safely change during the period of instance, in permanent memory 720 or temporary memory 

time labeled 664 in FIG. 6 for a given clock period 660. 760. Alternately, as shown in FIG. 8, machine executable 

In order to prevent hold time or setup time violations, the instructions 820, representing the function of clock tree 

minimum delay time 556 in FIG. 5, which is measured from 40 synthesizer 320, could be stored on distribution storage 

the rising clock edge 670 in FIG. 6, must be more than skew medium 810, such as a CD ROM, a digital video or versatile 

661 plus hold time 662 so that any data change at register disk (DVD), or a magnetic storage medium like a floppy 

530 happens after hold time 662 and during period 664. disk or tape. The instructions could also be downloaded 

Similarly, the maximum delay 557, measured from clock from a local or remote server. 

edge 670, must be less than the clock period 660 plus skew 45 Alternately, the present invention could be implemented 

661 minus setup time 663 so that any data change happens in any number of additional hardware machines. For 

before setup time 663 and during period 664. Solving the instance, one or more ASICs (application specific integrated 

equations for skew 661: circuits) could be endowed with some or all of the func- 
tionality of clock tree synthesizer 320, and inserted into 
so system 700 of FIG. 7 as separate components, or combined 

Max delay 557-Period 660+setup 663<skew 661<min delay 556- one or more omer components. 

hold 662. Thus, an improved method and apparatus for synthesizing 

Hold time or setup time violations can be delected using clock tree solutions has been described. Whereas many 

this condition. Maximum delay 557 minus period 660 plus alterations and modifications of the present invention will be 

setup time 663 is usually negative, and minimum delay 556 55 comprehended by a person skilled in the art after having read 

minus hold time 662 is usually positive. In which case, setup the foregoing description, it is to be understood that the 

time and hold time violations are usually eliminated by particular embodiments shown and described by way of 

making the magnitude of the skew as small as possible, zero illustration are in no way intended to be considered limiting, 

or nearly zero. Any number of additional methods could also Therefore, references to details of particular embodiments 

be used to test for violations. 60 are not intended to limit the scope of the claims. 

If violations are detected, one embodiment of the present What is claimed is: 

invention attempts to correct the violations by changing 1. A method comprising: 

positions of registers in the clock tree. The placement of the calculating a plurality of balanced cluster sets for a 

registers in the IC design is not altered. Instead, the points plurality of nodes comprising a first level of a clock tree 

at which the registers are coupled to the clock tree are 65 in a circuit description for consideration as part of a 

changed. For instance, skew is partially a result of process clock tree solution, each balanced cluster set based on 

dependent variations between buffers. That is, in the one of a set of available buffer types; 
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testing each of the balanced cluster sets to determine if a 

first design constraint is met; 
removing each of the balanced cluster sets that do not 

meet the first design constraint from consideration in 

the clock tree solution; . 5 

calculating a cost associated with each of the balanced 

cluster sets that do meet the first design constraint using 

a cost formula; and 
selecting a lowest cost balanced cluster set for the clock 10 

tree solution. 

2. The method of claim 1 wherein the lowest cost bal- 
anced cluster set comprises a plurality of nodes comprising 
a next level of the clock tree, the method further comprising: 

iteratively repeating the calculating the plurality of bal- . 15 
a need cluster sets, testing, removing, calculating the 
cost, and selecting for each next level of the clock tree 
up to a root level of the clock tree. 

3. The method of claim 1 further comprising: 2Q 
testing the clock tree to determine if at least one additional 

design constraint is met; 
adjusting a cost factor of the cost formula if the at least 
one additional design constraint is not met; and 

25 

repeating the method if the at least one additional design 
constraint is not met beginning with a leaf node level 
for the first level. 

4. The method of claim 3 wherein the cost formula 
comprises an area component and a delay component. 30 

5. The method of claim 4 wherein adjusting the cost factor 
comprises: 

selecting a next value in a binary search of a range of cost 
factor values, the range of cost factor values to define 
a relative importance of the delay component in the 35 
cost formula. 

6. The method of claim 4 wherein the binary search begins 
with a cost factor value that minimizes the relative impor- 
tance of the delay component in the cost formula. 

7. The method of claim 3 wherein the at least one 40 
additional design constraint comprises a minimum and 
maximum clock delay. 

8. The method of claim 1 further comprising: 

testing a pair of dependent registers at a leaf node level for 
setup time violations and/or hold time violations, 45 
wherein the pair of dependent registers comprises an 
independent register and a dependent register; and 

changing positions of the independent register and the 
dependent register in the clock tree until setup time 
violations and/or hold time violations are eliminated. 50 

9. The method of claim 8 wherein changing positions of 
the independent register and the dependent register com- 
prises at least one of: 

positioning the independent register and the dependent 55 
register at a same low level in the clock tree; and 

coupling the independent register and the dependent reg- 
ister in a cluster in an order of the independent register 
followed by the dependent register. 

10. The method of claim 1 wherein the first design 60 
constraint comprises a minimum and maximum clock tran- 
sition time. 

U. The method of claim 1 wherein removing balanced 
cluster sets from consideration comprises: 

setting a cost associated with each of the balanced cluster 65 

sets that do not meet the first design constraint to a large 

value. 
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12. The method of claim 1 further comprising: 
modifying the clock tree in the circuit description with the 

lowest cost balanced cluster set selected for the clock 
tree solution, each cluster including one buffer of the 
type of buffer on which the calculating was based. 

13. The method of claim 1 wherein the set of available 
buffer types is user defined. 

14. The method of claim 2 wherein a number of levels in 
the clock tree is user defined, and wherein iteratively repeat- 
ing occurs based on the number of user defined levels. 

15. The method of claim 2 further comprising: 
comparing a cumulative delay of levels of the clock tree 

solution with a difference between a delay constraint 
for a terminal node and a delay constraint for a leaf 
node; and 

including the terminal node in the plurality of nodes 
comprising the next level of the clock tree based on the 
comparing. 

16. The method of claim 1 wherein the plurality of nodes 
comprising the first level of the clock tree include a terminal 
node. 

17. The method of claim 16 wherein the terminal node 
includes one of a root of a partial tree, a user defined node, 
an input to a logic block, an input to a multiplier, and an 
input to a divider. 

18. The method of claim 1 wherein the set of available 
buffer types includes inverters. 

19. An article of manufacture comprising: 
a machine readable storage medium; 

the machine readable storage medium having stored 
thereon machine executable instructions, the execution 
of the machine executable instructions to implement a 
method comprising: 

calculating a plurality of balanced cluster sets for a 
plurality of nodes comprising a first level of a clock 
tree in a circuit description for consideration as part 
of a clock tree solution, each balanced cluster set 
based on one of a set of available buffer types; 

testing each of the balanced cluster sets to determine if 
a first design constraint is met; 

removing each of the balanced cluster sets that do not 
meet the first design constraint from consideration in 
the clock tree solution; 

calculating a cost associated with each of the balanced 
cluster sets that do meet the first design constraint 
using a cost formula; and 

selecting a lowest cost balanced cluster set for the clock 
tree solution. 

20. An apparatus comprising: 

first circuitry to calculate a plurality of balanced cluster 
sets for a plurality of nodes comprising a first level of 
a clock tree in a circuit description for consideration as 
part of a clock tree solution, each balanced cluster set 
based on one of a set of available buffer types; 

second circuitry to test each of the balanced cluster sets to 
determine if a first design constraint is met; 

third circuitry to remove each of the balanced cluster sets 
that do not meet the first design constraint from con- 
sideration in the clock tree solution; 

fourth circuitry to calculate a cost associated with each of 
the balanced cluster sets that do meet the first design 
constraint using a cost formula; and 

fifth circuitry to select a lowest cost balanced cluster set 
for the clock tree solution. 

***** 
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[57] ABSTRACT 

A microelectronic circuit includes a plurality of circuitry 
blocks and sub-blocks, a clock driver, an electrical intercon- 
nect that directly connects the clock driver to the sub-blocks, 
and balanced clock-tree distribution systems provided 
between the electrical interconnect and circuitry in the 
sub-blocks respectively. Amethod of producing a hierarchial 
clock distribution system for the circuit includes determin- 
ing clock skews between the clock driver and the sub-blocks 
respectively. Delay buffers are selected from a predeter- 
mined set of delay buffers having the same physical size and 
different delays, with the delay buffers being selected to 
provide equal dock skews between the clock driver and the 
distribution systems respectively. Each delay buffer includes 
a delay line, and a number of loading elements that are 
connected to the delay line, with the number of loading 
elements being selected to provide the required clock delay 
for the respective sub-block. 

16 Claims, 3 Drawing Sheets 
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HIERARCHIAL CLOCK DISTRIBUTION The output delay buffers B of the fanout circuit 30 are 

SYSTEM AND METHOD connected to the input delay buffers B of the fanout circuits 

32, 34 and 36 t which collectively produce nine outputs. Each 

This application is a continuation of Ser. No. 08/482,763 0Ut P ut °f ^ c fen«rt circuits 32, 34 and 36 is connected to an 

filed Jun. 7, 1995, now U.S. Pat No. 5,570,045. 5 in P ai another fanout circuit to provide a total of 27 

outputs, and the hierarchial chain can continue to as many 

BACKGROUND OF THE INVENTION levels as desired. Only one fanout circuit 38 is illustrated as 

1. Held of the Invention being connected to an output buffer B of the fanout circuit 

The present invention generally relates to the art of 32 J? ^^L* ^Tf* ^ • 

microelectronic integrated circuits, and more specifically to io ^output buffers of the fanout circuits are connected to 

a hierarchial dock distribution system and method for r ^^ ara ? t ^ 0cks r °^ a ^^ctro^c integrated 

cptimally equalizing clock skew to circuitry in blocks of an T^kJt ^ ?^2S?$ JSf ^mma 

inteffated circuit described above with reference to FIG. 1. Thus, the skew 

o%T • -**v « t * a *.* from the clock input to the blocks can be equalized. 

2. Description of the Related Art _ TV..- ^ . A . * 

. . . , ^_ . . * _i_ is However, it is difficult to predetermine and accurately 

A large rmcroelecfronic integrated cmant .such as an 15 ^ ^ . ^ ^ buffers m me ™ emfiI J 

Application Specific Integrated Circuit (ASIC), generally 2r^TV7 . * . , , » *^ 

fn^H.c n n,7™Lr *f ^JL™ Kwvc JmnHn lc tw Z of nG 2 because any inaccuracy is passed to downstream 

mcludes a number of arcuitty blocks or modules that can fanout 4^ ^ ^ ^ For ^ reason> me 5uffcrs B „ 

toemselves mdudesub-block^ ; m a hierarctad arrangement ^ lemented M progranimable delay elements such as iUus- 

The circuitry is driven by clock pulses that are applied u^d inTO 3 ^^^^^ u^*- 

through an input clock driver and distributed via intercon- 20 « . , j . ^ JJJCJ 

nect wiring to the various blocks of the circuit and other A f** Jf^f*^ of 

devices that are not included in the blocks. < * n f Bte fr ^flf! 1 ^ 

In order for the circuit to function properly, the clock tojopattoftm^^ 

, . , , ^ " . i^^f^v* ^ delay element is equal to the delay that itself produces plus 

pulses must amve at each clocked orout ■ element at the ^ Cumulated alay of the upaLm delay ekmeTts. lie 

same time. However the lengfts of the wir^g that conduct 23 ou ^ ut of th c ^ e i eme „, 42 has a minimiL delay value, 

fte dock to thedifferent Hecks wffl generally be ^ ^ ^ nJximun^ 

different Since the length of tune required for an electrical delav value 

signal to propagate through a wire is proportional to the ... * ,....„ . . 

length of the wire, the clock pulses will arrive at the blocks ^°?^^^ d L m l Strat ^ ^ Sy$tCm COmpn f S 

at different times 30 a phase locked loop ox other type of phase comparator that 

t aaz* . *u « . . . . compares reference clock pulses having the required skew 

In addition, different types of buffers may be used in each . . . nnirt*. *t_ i*. i ™ * , 

block, creatin* differences between the dockoulse arrival to Wlth out P ut P ulses CLOCK* from the multiplexer 50 of each 

Wocjc, aearmg dxtrerences between trie ^ pulse arrival to buffer B ^ comparator then generates and applies a 

die clocked circmt dements m the blocks. The phase or uj . SELECT sigk to the multiplexer 50 of each buffer 

uming difference between the clock pulse arrival tune to any B designating wruch multiplexer input (output of respective 

two clocked circuit element in a microelectronic integrated . , , ° At% AA A f f T, *~T^ 

drruit « railed <l«w ^ dcment 42 > 46 or *•> t0 P 3 " therethrough as 

arcun is caueo sxew. outout pulses CLOCK*. The value of the SELECT signal 

It is therefore necessary to provide means for minimizing corxesponds to me delay required to make the phase or skew 

the skew in the circuit and restore synchronism to the of me CLOCK' coincide with that of the reference 

operation of the circuit This function can be provided by m pulses. 

inserting delay buffers in the circuit having different delays AU1 * . t . . ^ t . , A , 

to compensate for the different values of delay at the . Al^ugh effective m cqu^g the slrew in an mtegrated 

indiviZdblocks. erremt laving a hierarchial Mock structure, the arrangement 

unuviuuai uiuvju. of pj. GS j t<) j fa disadvaIltageous ^ ^ j t reqU ifCS 

inXKSSSSSS programmable delay buffers and phase comparison circuitry. 

MICROPROCESSOR" , issued Apr. 26, 1994 to B.Ahuja. A & 

simplified diagram illustrating this system is presented in SUMMARY OF THE INVENTION 

FIG. L E is an object of the present invention to provide a 

Input clock pulses CLOCK are applied to a plurality of 50 hierarchial clock distribution system and method for a 

delay buffers 10, 12 and 14, which are connected through microelectronic integrated circuit that enables accurate 

lines 16, 18 and 20 to circuit blocks 22, 24 and 26 respec- clock delay compensation using fixed delay buffers to mini - 

tively. The buffers 10, 12 and 14 delay the clock pulses by mize the skew. 

different lengths of time to compensate for the different A microelectronic circuit includes a plurality of circuitry 

lengths of the lines 16, 18 and 20 such that the clock pulses 55 blocks and sub-blocks, a clock driver, an electrical intercon- 

CLOCK arrive at the blocks 22, 24 and 26 simultaneously. nect that directly connects the clock driver to the sub-blocks, 

An extension of this concept to a hierarchial structure of and balanced clock-tree distribution systems provided 

circuitry blocks is disclosed in U.S. Pat. No. 5.258,660, between the dectrical interconnect and circuitry in the 

entitled '"SKEW-COMPENSATED CLOCK DISTRIBU- sub-blocks respectively. 

HON SYSTEM**, issued Nov. 2, 1993 to S. Nelson etaLA <so A method of providing a hierarchial clock distribution 

simplified diagram of this system is presented in FIG. 2. system for the circuit includes determining clock delays 

The system comprises a plurality of fanout circuits 30, 32, between the clock driver and the clocked circuit elements 

34, 36 and 38, each including an input delay buffer and a within sub-blocks respectively. Delay buffers are selected 

plurality of output delay buffers which are collectively from a predetermined set of fixed delay buffers having the 

designated by the reference character B. As illustrated, each 65 same physical size and different delays, with the delay 

fanout circuit has three outputs, although the actual number buffers being selected to provide equal clock delay between 

of outputs is not relevant the clock driver and the distribution systems respectively. 
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Each delay buffer includes a delay line, and a number of Balanced clock tree distribution is known in the ait per se, 

loading elements that are connected to the delay line, with and the details thereof are not the particular subject matter 

the number of loading elements being selected to provide the of the present invention. A basic treatise on this subject is 

required dock delay for the respective sub-block. presented in an article entitled "AN IMPLEMENTATION 

These and other features and advantages of the present s O^-^ DBT^UTION SCHEME FOR 
iuwc auu wici iwwiw ouu u«r^tag» «i ^ f»wu HIGH-PERFORMANCE ASICS , by A. Erdal et al, in 
invention will be apparent to those skflledin meartfromfce ^ce^ui^ of the Annual IEEE mSrnational ASIC Con- 
following detailed description taken together with the fcrcncc ^ Rochester, N.Y, September 1992, pp. 
accompanying drawings, in which like reference numerals 26 to 29 

refer to like parts. ^ ^ g encra l, balanced clock tree distribution is performed 

npsranmrw op tttp ™ awtwos usm « a dock P^ 011 10 ^ *c original clock net into 

DESCRIPTION OF THE DRAWINGS buffered sub-clock nets in a bottom-up fashion after place- 

FIG. 1 is a simplified diagram illustrating a prior art ment of the cells C, without the local buffers D. An initial 

system for clock skew equalization; of dodnd cells C (cells having clock pins) is 

_„ ^ . . _ M1 t . . t , obtained according to the spread of the cells C and the 

FIG. 2 is a simplified togramiUustrattng how ^system 1S ^ m * nd^boring clocked cells C. Then, the 

of FIG. 1 can be extended to a hierarchial circuitry block docked ^ c m exchanged among the groups to get the 

arrangement; optimum result 

FIG. 3 is a diagram illustrating a programmable delay jj^ objective is to rninirnize both the inaxirmim absolute 

buffer of the system of FIG. 2; loading difference among the groups and the standard devia- 

FIG. 4 is a diagram illustrating a microelectronic inte- 20 tion of the loading of the groups. After the grouping, an 

grated circuit including a hierarchial clock distribution sys- appropriate number of balance cells (not shown) are added 

tern embodying the present invention; and to balance the loading for each group. 

FIG. 5 is an electrical schematic diagram illustrating a The location of each balance cell is calculated to balance 

fixed clock delay buffer of the system of FIG. 4. the area and spread of the group with respect to other groups. 

The location of the local buffers D and E are calculated as 

DETAILED DESCRIPTION OF THE the optimum balance center, based on the estimated routing 

INVENTION pattern of the group to minimise the skew among the 

. . . . - ^. _ 4 « . . ^ clocked cells C. Finally, all the balance cells and buffers D 

A hierarchial clock distribution system embodying the _ . . ' < « . * , antrtmflf{/yi11v 

present invention is illustrated in FIG. 4 and generally 30 f 11 ^ E are inserted into the de^SP ^ placed automatically 

*T . \ . Z 1 " . 7z \Z * " : in layout, based on the calculated coordinates, 

designated by the reference numeral 60. The system 60 is ' ^ . t _ - , . , 

interne/as part of a microelectronic integrated circuit The Present invention enables effective .clock stew com- 

ttwhich typicaUyreceives clock pulses CLOCK from an ff°» tI0n . ua *« *f buffers B '^*?J^"° 

external source. However, it is within the scope of the ^ usingtoe prior art arrangements fflustxatedui HGS. 

invention, although not explicitly illustrated, to provide a 33 h 2 j md 1 3 : T1 ^ 1 i Ven ? 0n f cc^lishesthis goal by provid- 

clock pulse generator as pit of die draft 62 itself. ™? *» *«V ^«B^ * one sub-block level, thereby 

_ . , . „ „„„ ,. . , , , . eliminating awumnlatwl hierarchial delay macairaciBS, and 

The clock pulses CLOCK are applied to a clock driver 64, perfc^ block level skew compensation using the bal- 

which applies the clock pulses through an electrical inter- clock ^ dfettibutfo,, systems, 

connect ^wiring 66 to miaoelectronic circuit modules or M mustratcdinFIG , s , each delay buffer B has the same 

7^72? 7? comprises sub-blocks « physical ^^^^ fadUtatepkcementin the circuit 62. 

^ ' ( Each buffer B comprises the same number, here illustrated 

The wiring 66 is connected to a clock delay buffer B in as four, of logic elements which provide known delays. As 

each block 68 and 70, and in each sub-block 72a, 72b and sn own m piG. 5, the logic elements are inverters 90, 92, 94 

72c. It will be noted that no buffer Bis provided in the block and 9^ although the invention is not so limited The 

72 between the wiring 66 and the sub-blocks 72a, 72b and inverters can be replaced by, for example, NOR gates or wire 

72^' delay lines, although not explicitly illustrated. 

-Although only two levels of hierarchy are illustrated in The inverters 90, 92, 94 and 96 are connected in a cascade 

FIG. 4, consisting of one block level and one sub-block or chain, such that the clock pulses CLOCK' are delayed by 

level, the invention is not so limited. A hierarchial structure ^ the sum of the delays provided by the individual inverters 

including any number of block/sub-block levels can be 90, 92, 94 and 96. For example, each inverter provides a 

provided in accordance with the present invention. delay of 0.25 ns, such mat the total delay provided by the 

However, a delay buffer B will be provided typically at the inverters 90, 92, 94 and 96 alone is 1.0 ns. 

first and second hierarchial block level. Each delay buffer B is capable of providing a delay 

The individual time delays provided by the delay buffers 55 ranging from 1.0 ns to, for example 3.0 ns, by variably 

B are selected to equalize the dock delay between the driver loading the outputs of the inverters 90, 92, 94 and 96 using 

64 and clocked circuit elements or cells C in each block loading elements 98. Each loading element 98 comprises, in 

Skew compensation within the blocks or sub-blocks 68, 70, the illustrated example, a PMOS field-effect transistor 98a 

72a, 72b and 72c is provided by balanced clock tree distri- and/or an NMOS field-effect transistor 984? that have their 

bution systems 74, 76, 78, 80 and 82 respectively. & gates connected to the output of the respective inverter. The 

The specific arrangement of each distribution system will source and drain of each transistor 98a is connected to a first 

depend on the circuitry in the respective block or sub-block. constant electrical potential source VDD, whereas the 

For simplicity of illustration, each balanced clock tree source and drain of each transistor 986 are connected to a 

distribution system 74. 76, 78, 80 and 82 is shown as second constant electrical potential source which is illus- 

comprising the clocked circuitry elements or cells C, and 65 trated as being ground. 

local buffers or drivers D and E that are connected between Each loading element 98 causes the delay of the respec- 

rhe buffers B and the cells C rive inverter 90, 92, 94 and 96 to be increased by, for 



04/21/2004, EAST Version: 1.4.1 



5,686, 

5 

example, 0.1 ns. A number of delay elements 98 from 0 to 
5 can be connected to the output of each inverter 90, 92, 94 
and 96, such mat the total delay can be increased by 4 
invertersxS loading deancnts/tnverterxO. 1 ns delay/loading 
element=Z0 ns. Thus, the total maximum delay that can be 5 
provided by each delay buffer B is 1.0 ns+2.0 ns=3.0 ns, and 
the delay can be varied from 1.0 ns to 3.0 ns in 20 increments 
of 0.1 ns/increment. 

In configuring a particular buffer B, loading elements 98 
are first provided at the output of the first inverter 90. If more *° 
loading elements 98 are required, they are provided at the 
outputs of the inverters 92, 94 and 96 in consecutive order. 

The design of the integrated circuit 62 is facilitated by 
providing a circuit library set of 20 delay buffers 98 which 
differ from each other only in that they have 20 different 13 
numbers of loading elements 98 to provide the 20 different 
values of delay respectively. 

A timing analysis is performed on the circuit 62 to 
determine the value of delay between the clock driver 64 and 
each clocked cell C This is accomplished by assigning 
initially a minimum delay value (1.0 ns) to each delay buffer 

B, and determining the delay at the input pin of clocked cell 

C. The timing analysis can be advantageously performed 
using, for example, the Timing Analyzer Release 2.2 which 
is commercially available as part of the Concurrent MDE® 
Design System (C-MDE® Design System) from LSI Logic 
Corporation of Milpitas, Calif. 

After the delay corresponding to each clocked cell C is 
determined, the value of delay which the buffer B is required 3Q 
to produce in order to equalize the delay at the inputs of all 
of the clocked cells C is calculated, and one of the 20 
possible buffer configurations is selected from the library set 
which has the corresponding value of delay. The buffers B 
are then inserted into the design and placed automatically in 35 
layout, based on the required delay values. 

Various modifications will become possible for those 
skilled in the art after receiving the teachings of the present 
disclosure without departing from the scope thereof. 

For example, the numbers of delay elements and loading 40 
elements in the buffers B as described above, as well as the 
specific delays that they provide, are exemplary only, and 
can be varied in any manner to accommodate a particular 
application. 

We claim: 45 

1. A hierarchial clock distribution system for a microelec- 
tronic circuit including a plurality of circuitry blocks and 
sub-blocks, comprising: 

a clock driver; 

delay buffers provided in the sub-blocks respectively; 50 
an electrical interconnect that directly connects the clock 

driver to the delay buffers; and 
balanced clock-tree distribution systems provided 

between the delay buffers and circuitry in the sub- ^ 

blocks respectively; 
the delay buffers providing equal clock skews from the 

clock driver to the distribution systems respectively. 

2. A system as in claim 1, in which the delay buffers have 
the same physical size. .. ^ 

3. A hierarchial clock distribution system for a microelec- 
tronic circuit including a plurality of circuitry blocks and 
sub-blocks, comprising: 

a clock generator; 

delay buffers provided in the sub-blocks respectively; 65 
an electrical interconnect that directly connects the clock 
generator to the delay buffers; and 
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balanced clock-tree distribution systems provided 
between the delay buffers and circuitry in the sub- 
blocks respectively; 

the delay buffers providing equal clock skews from the 
clock generator to the distribution systems respectively, 

wherein the delay buffers have the same physical size and 
comprise identical delay lines that are loaded to equal- 
ize said clock skews respectively. 

4. A system as in claim 3, further comprising a clock 
driver connected between said clock generator and said 
electrical interconnect 

5. A system as in claim 3, forma comprising a plurality 
of loading elements, in which: 

the delay lines comprise strings of logic elements; and 
the loading elements are connected to outputs of the logic 
elements. 

6. A system as in claim 5, in which each logic element 
comprises an inverter. 

7. A system as in claim 5, in which each logic element 
comprises a NOR gate. 

8. A system as in claim 5, in which each logic element has 
a number n of loading elements connected to the output 
thereof, where 0=n^N, and N is a predetermined maximum 
value. 

9. A system as in claim 8, in which N=5, each of said 
loading elements increases the delay by about 0.1 ns, and 
there are four of said logic elements each having a delay of 
about 0*25 ns, whereby the delay can be varied from about 
1 ns to about 3 ns in 20 increments of about 0.1 
nsfincrement. 

10. A clock delay buffer for a rmcroelectronic circuit, 
comprising: 

a delay line comprising a string of logic elements; and 
a number of loading elements that are connected to the 
delay line, said number being selected to provide a 
predetermined clock delay, wherein said loading ele- 
ments comprise MOS-type field-effect transistors hav- 
ing gates connected to outputs of the logic elements, 
and sources and drains connected to a constant electri- 
cal potential. 

11. A buffer as in claim 10, in which each loading element 
comprises: 

a PMOS field-effect transistor having a gate connected to 
an output of one of the logic elements, and a source and 
a drain connected to a constant electrical potential. 

12. A buffer as in claim 10, in which each loading element 
comprises: 

a NMOS field-effect transistor having a gate connected to 
an output of said one of said logic elements, and a 
source and a drain connected to a constant electrical 
potential 

13. A buffer as in claim 10, in which each logic element 
comprises an inverter. 

14. A buffer as in claim 10, in which each logic element 
comprises a NOR gate. 

15. A buffer as in claim 10, in which each logic element 
has a number a of loading elements connected to the logic 
element, where 0=n^N, and N is a predetermined maxi- 
mum value. 

16. A buffer as in claim 15, in which N=5, each of said 
loading elements increases the delay by about 0.1 ns, and 
there are four of said logic elements each having a delay of 
about 0.25 ns, whereby the delay can be varied from about 
1 ns to about 3 ns in 20 increments of about 0.1 
ns/increment. 

* * ♦ * * 
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ABSTRACT 



A method and apparatus for inserting clock buffers to reduce^ 
<doc&:ske^irra"logfc 

t of ;the:cclls:wilhiii;the logicj)lpck is first determined. Given 
this cell placement and the location of the local clock lines, 
the placement of clock buffers within the logic block is 
determined such that the clock ^buffers are in closeiproximity 

(to th e local clrck l mesrRoutirm 
the clock buffers toimeircpTTK 
^ ^l^anairing-clocksign 

C^uffefs. The performance of.theiogic blockj^then-evalu- 
ated. ^the^performance-doesrnot satisfy a-predelelmined: 
minimum thresholdithen the cells. are modified to satisfy'the 
minimum ,thics^ld;pT;cp'mc .closer to attaining it The clock 
buffers' are removed^ and the proper placement of thenew v 

[Celjs within the logic block is determined? GTvenTtlns new 
cell placement a 'ntwsetof clock buffers is placed and a new* 

^routing"ircreated. The perfcrmanceis tKenTe-evalu^^ 
if the niinimu nv thr eshold still-has -not been attained, the 
abow^olcessns^re^eatedr " 

24 Claims, 9 Drawing Sheets 



Logic Clock Buffer Buffer Clock 

Block Buffer Line Latches Line Buffer 
405 430 435 410 435 430 



437 



Channel 
4181 



Rows 
415 





Latches 
410 



Clock Trunk 
420 



437 Clock Trunk 
422 



04/21/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct 8, 1996 



Sheet 1 of 9 



5,564,022 




FIG. 1A 
Prior Art 



04/21/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct 8, 1996 



Sheet 2 of 9 



5,564,022 




FIG. 1B 

Prior Art 



04/21/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct 8, 1996 



Sheet 3 of 9 



5,564,022 




Display 
Device 



"7" 
205 



Cursor 
Control 
Device 



T 7 " 

207 



Storage 
Device 

204 



FIG. 2 



Signal 
Generation 
Device 

"? 

208 



04/21/2004, EAST Version: 1.4.1 



U.S. Patent Oct. 8, 1996 Sheet 4 of 9 



5,564,022 




Strip Out 5§0 
Clock Buffers 



± £ 



310 



Placement 



315 

Insert Buffers 



312 

y 


320 
/ 


322 

^ / 


324 


Ch/ 




Buffed 
Fife 




Clo,clf 
Names 




Miri/ 
Deja.f 
File 



Routing 



325 



330 

Schematic — 
with new buffers 






Perfpr 


335 
mance^ 
Nation 



345 
Schematic 
Modification 



340 
Finished 
ISchematicJ 



RG.3 



04/21/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct 8, 1996 



Sheet 5 of 9 



5,564,022 




04/21/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct 8, 1996 



Sheet 6 of 9 



Logic Clock Buffer Buffer Clock 

Block Buffer Line Latches Line Buffer 
405 430 435 410 435 430 




Clock Trunk 437 Clock Trunk 
420 422 



FIG*-4B— ^ 



04/21/2004, EAST Version: 1.4.1 



U.S. Patent Oct 8, 1996 Sheet 7 of 9 5,564,022 



Logic Routing Clock Buffer Buffer 
Block Line Buffer Line Line 
405 442 430 435 435 



ROWS 
415 



Clock Trunk 
420 



Clock Trunk 
422 

FIG. 4C 




Routing 
Line 
442 



04/21/2004, EAST Version: 1.4.1 



U.S. Patent 



Oct 8, 1996 



Sheet 8 of 9 



5,564, 




04/21/2004, EAST Version: 1.4.1 



U.S. Patent Oct. 8, 1996 Sheet 9 of 9 5,564 



Logic 
Block 
405 



Latches 
410 










I — * 


< 






















i 


«5 ► 


>> 


— ► 




^ 





Clock Trunk 
420 



Clock Trunk 
422 

FIG. 5B 



04/21/2004, EAST Version: 1.4.1 



5,564,022 



10 



METHOD AND APPARATUS FOR 
AUTOMATICALLY INSERTING CLOCK 
BUFFERS INTO A LOGIC BLOCK TO 
REDUCE CLOCK SKEW 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The^presenLin_vention-pcxtmiM:to:thie field of microprcy- 
^esspTrarcffitecmre^nd^layout^More particularly, this inven- 
tion relates to placing-properly sized clock buffers in :the 
proper-location within a" lo'gic-bl6c^:to:r^ce''tloclc..skBw. 

2. Background 

Components of an integrated circuit operate based on 
timing and pulsing of clock signals which provide a refer- j 5 
ence point or activation signal for circuit activity and 
processing. The clock signals also provide a timing or 
alignment reference which different circuits adopt when 
stepping through their respective processing tasks. It is 
important that die clocking signals be predictable and not 2 o 
delayed such that processing and execution by circuit com- 
ponents are accomplished in synchronization. Microproces- 
sor integrated circuit devices utilize a system clock which 
provides timing and pulsing to drive the various elements 
and processing of the microprocessor. 25 

It is vital to the operation of a microprocessor that the 
system clock be supplied uniformly to all components of the 
microprocessor with minimal clock skew.£lock skew refers^ 
to the variations in timing delays betwe^ x ¥:systerarclock 
and a' clock" signal reaching a component~Resistance within 30 
the clock line and capacitance on the clock line creates RC 
skews, a type of clock skew, as the clock signal propagates. 
Clock buffers can be used to deskew the clock signal, thus 
a system for automatically placing the proper clock buffers 
in the correct locations would be advantageous. 35 

A similar problem is that of a rriinimum delay between 
latches. A minimum delay problem may arise when the 
signal from a source latch is input into a receiving latch. If 
the clock signal driving the receiving latch reaches the 
receiving latch after the signal from the source latch arrives, 40 
the receiving latch may latch the wrong data. Thus, a buffer 
may be inserted into the line between the two latches to 
create a delay such that the signal does not arrive prior to the 
clock signal. 

Design techniques for microprocessors may include uti- 
lization of a large number of functional blocks in order to 
shorten the design cycle. The functional blocks consist of a 
varying number of cells and utilize clock buffers to prevent 
clock skew. As microprocessors use faster and faster clock 
speeds, variations in clock skew within the functional blocks 
becomes a major concern. The slower clock speeds used in 
older microprocessor technology were slow enough that the 
clock skew within the functional blocks could be either 
ignored or resolved easily. However, faster clock speeds 
require that the problem of clock skew be addressed more 
efficiently. 

In addition, microprocessor development times have 
become shorter and shorter. Therefore, an automatic system 
for the designer to insert the properly sized clock buffer in 
the proper location would be advantageous. 

Thus, it would be advantageous to automatically opti- 
mally insert the proper clock buffers into the functional 
blocks. The present invention offers such a solution. 

An example prior art placement of clock buffers is shown 65 
in FIG. 1A. The clock buffers 110 may have been placed 
arbitrarily, or at the very least in a non-optimized manner. 



45 



50 
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60 



That is, the clock buffers U0 were not guaranteed to be 
placed close to the clock line and thereby reduce clock skew. 
Additionally, as shown, the latches 120 were not necessarily 
driven by the clock buffer 110 located closest to each latch 
120. 

Other prior art placements of clock buffers may have 
attempted to solve the clock skew problem by placing clock 
buffers close to the latches being driven, as shown in FIG. 
IB. However, as is readily apparent in FIG. IB, the place- 
ment of the clock buffers 110 is not optimized because the 
buffers 110 are not placed close to the clock line 100. The 
extra distance between the clock line 100 and the buffers 110 
over narrower lines 115 cause additional RC skew, thus, the 
reduction of clock skew is not optimized. 

SUMMARY AND OBJECTS OF THE 
INVENTION 

The present myentioji comprises^ 
for inserting local cloc^ujfesin a logic block; The present 
invention^firstlletermines-the proper placelnent of the cells 
within the logic block. Then, given the:cell:placement:and ~j 
the -loc ation of th e local _clock truriksrthe:mYehUon~places 
the clocj^buffers within the logic;block m'clbse.proximity to 
me:ldcai5clock trunks. rRoutingjs men.performed;to:cormect 
the ^oj*:buffef^to teirxOT 
cells-r^uiring:a:dcck:sig nal~to-to 
buffers:^ 

The rjerformance^pf Jhe :block is^men:evaluated. If the^> 
performance' does^ not; meeC "a pnxlctcmiirTed nuriitnum 
threshold-thenjthe cellsiare modified :to~aitamltie mini m um 
threshold, jorjepme closer to artajmng:itr'nie:clock' buffers 
previouslyjinsert edare removed^ and the proper:placenienU 
of ithe new cdls^vithin.the logic block is.detennined; Them 
gi venj 1 t^;new:ceH-place^ 

clock ^ t runksrthe jnvention.pl aces a n ew set of ; clock buffers 1 
withm^eJogic.blc^in^ose: 

trunks. A~new_routing is then create^ to connect thejcloek 
buffers;to:trieirjx)rr^^ and the cells to 

-thejr^correspon^ 

The performance- of-jWsjiejy^^ 
detenmne^^ 

(threshold: If it doesrthen :me;process;is;complete: However, 

if the^minimum threshold is^ot ^atisfied-jhenHth^ 

repeat^;tlie:aboYe:proceMemocUrV 

them^-and -reinserting, a„new-se^of;clock^buffere^This 

processj^wJU^be^^ 

satisfied. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example 
and not limitation in the figures of the accompanying 
drawings, in which like references indicate similar elements 
and in which: 

FIG. 1A is a block diagram of a prior art placement of 
clock buffers; 

FIG. IB is a block diagram of a prior art placement of 
clock buffers; 

FIG. 2 is a block diagram of a computer system used by 
the preferred embodiment of the present invention; 

FIG. 3 is a flow chart of the steps of the preferred 
embodiment of the present invention; 

FIG. 4A is a diagram showing the placement of cells 
without clock buffers in the preferred embodiment of the 
present invention; 
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FIG. 4B is a diagram showing the placement of cells and 
clock buffers before routing in the preferred embodiment of 
the present invention; 

FIG. 4C is a diagram showing the placement of cells and 
clock buffers after routing in the preferred embodiment of 5 
the present invention; 

FIG. 5A is a diagram showing the placement of cells after 
having the clock buffers stripped in the preferred embodi- 
ment of the present invention; and 

FIG. SB is a diagram showing a new placement of cells 
and clock buffers in the preferred embodiment of the present 
invention. 

DETAILED DESCRIPTION 15 

In the following detailed description of the present inven- 
tion numerous specific details are set forth in order to 
provide a thorough understanding of the present invention. 
However, it will be obvious to one skilled in the art that the ^ 
present invention may be practiced without these specific 
details. In other instances well known methods, procedures, 
components, and circuits have not been described in detail 
as not to unnecessarily obscure the present invention. 

Some portions of the detailed descriptions which follow 25 
are-p^enjed:in:te 
sentatioj^-oj^perations-on^ 

memory. These algorithmic descriptions and representations 
are the means used by those skilled in the data processing 
arts to most effectively convey the substance of their work 30 
to others skilled in the art. An _ algorithm-is -hergrjipd- 
gene^ly r conceiyedtabe;a selfrConsislenrse^iuinc?of steps 
jeaa^g2toza:desired^sult. The steps are those requiring 
physical manipulations of physical quantities. Usually, 
though not necessarily, these quantities take the form of 35 
electrical or magnetic signals capable of being stored, trans- 
ferred, combined, compared, and otherwise manipulated. It 
has proven convenient at times, principally for reasons of 
common usage, to refer to these signals as bits, values, 
elements, symbols, characters, terms, numbers, or the like. It 40 
should be borne in mind, however, that all of these and 
similar terms are to be associated with the appropriate 
physical quantities and are merely convenient labels applied 
to these quantities. Unless specifically stated otherwise as 
apparent from the following discussions, it is appreciated 45 
that trirpughout-the- present-in ^ 
terrnsjiucjras^rocess^ 
ojr£toennimng , : ^or^ 

acuo n-and -processcs~of-a-computcr- system^orr similar :r ^ 

electroiiic -computinfirde vice,: that manipulates and trans- 50 

forms^dam-representeji^zphysic^^ 

within the computer system's registers and memories into 

other data similarly represented as physical quantities within 

the computer system memories or registers or other such 

information storage, transmission or display devices. 55 

In general, computer systems used by the preferred 
embodiment of the present invention are as illustrated in 
block diagram format in FIG. 2, and comprise a bus 200 for 
communicating information, a central— processor— 20k 
coupled with the bus for processing information and instruc- 60 
tions, a random access memory 202 coupled with the bus 
200 for storing information and instructions for the central 
processor 201, a read only memory 203 coupled with the bus 
200 for storing static information and instructions for the 
processor 201, a data storage device 204 such as a magnetic 65 
disk and disk drive coupled with the bus 200 for storing 
information (such as audio or voice data) and instructions, a 
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display device 205 coupled to the bus 200 for displaying 
information to the computer user, an alphanumeric input 
device 206 including alphanumeric and function keys 
coupled to the bus 200 for communicating information and 
command selections to the central processor 201, a cursor 
control device 207 coupled to the bus for communicating 
user input information and command selections to the cen- 
tral processor 201, and a signal generating device 208 
coupled to the bus 200 for communicating command selec- 
tions to the processor 201. 

The display device 205 utilized with the computer system 
and the present invention may be a liquid crystal device, 
cathode ray tube, or other display device suitable for creat- 
ing graphic images and alphanumeric characters (and ideo- 
graphic character sets) recognizable to the user. The cursor 
control device 207 allows the computer user to dynamically 
signal the two dimensional movement of a visible symbol 
(pointer) on a display screen of the display device 205. 
Many implementations of the cursor control device are 
known in the art including a trackball, mouse, joystick or 
special keys on the alphanumeric input device 205 capable 
of signaling movement of a given direction or manner of 
displacement. It is to be appreciated that the cursor means 
207 also may be directed and/or activated via input from the 
keyboard using special keys and key sequence commands. 
Alternatively, the cursor may be directed and/or activated 
via input from a number of specially adapted cursor direct- 
ing devices, including those uniquely developed for the 
disabled. In the discussions regarding cursor movement 
and/or activation within the preferred embodiment, it is to be 
assumed that the input cursor directing device or push button 
may consist of any of those described above and specifically 
is not limited to die mouse cursor device. 

Logic blocks comprising a plurality of standard cells are 
well known in the art. In the preferred embodiment of the 
present invention, these cells are placed and the proper clock 
buffers to drive the latches (or any other cells requiring a 
clock signal) are automatically inserted; the block is then 
modified, if necessary, to satisfy performance requirements: 

In the preferred embodiment, the cells are placed within 
the logic block in a double back row configuration. That is, 
cells are placed in pairs of rows 415 separated by channels 
418, as shown in FIG. 4A. However, it should be readily 
apparent to those of ordinary skill in the art that the present 
invention may be utilized in any of a wide variety of 
placement configurations. 

A flowchart of the method of the preferred embodiment of 
the present invention is shown in FIG. 3. A:schematicr:or> 
netiist-isifimrcre^O 

nectivity:between tiie;cells:within the blbckTIn a logic block 
wjdrjr^tiple^c^^ schematic"also"describcs 
whic h-clock-si R nal-s hould :be:«)nnected:to::and therefore 
anve^wjuchiatehesrlfja particular logic block utilized only 
;a smgle__clci^ 

eacl^teh as being cormectedjp ^^ smgle clock signal The 
schematic also contains the. loads, of .each latch, within the 
scjgrnaticr:wMch:is:df!^b^ in more detail-below.- 

The s chematicjtefmesrthe : and 
clock :sigrialsrhowever,"it does not describe placement of 
the ceUsmrelaticm to one anoth^ 

preferred embodiment of the rjresent:inventio^ r a nedist is ^^s 
r used-to-de scriperthe-coimrc should be )] 

aprjarem-to-those~of-c^61ria^ U 
tocripfonof-theceUc^^ 

The c netlist : from: step -305 :is input: into a: standard cell^ 
placement-s vstem t -ste p-310d Th e-cell-placement-S YStem 
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placesi^rc^-ft^n^^rov^pairs to optimizethe routing, 
andjhezarea^of-the^ ceirplacement 
system places the cells to obtain the best connectivity 
between the cells. In the preferred embodiment of the 
presem:jnventim^t^^ is 
used-by ^-piaceinenCsysteiirin step 310. It should be 
apparent to those of ordinary skill:m:the:art,Thowever~that 
any:standard:ceU:placeinent3system^ 

In the preferred embodiment, an additional input to the 
placement sy^tejrHs;the;chip;plan 312. The chipplanrStt^ 
a d^iption^o^5ezpinrlayout:of tiie cm^ xontalnuig:tbB 
logic block^mgrd^gneid^One element of the chip plan 
312 is a descri ption of-the loca l clock-line ( s) wrach driyfr the> 
cells^wjtMruheJpgiciblocks thechipjrfan 
312 id entifie s to:the placement;systejn:th^ldcation(s) of the 
clock line(s) used by any particular logic block. 

An example placement of cells is shown in FIG. 4A. 
MultipjeJatchesz410-aie3Hd^~pl 
dpuble.back .rows 415 in a logic block 405. Additional cells 



andjhejrjespe^ the 
dalKolnimeli^fl^ 

TABEE-F— ^ 
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The- presem-invj fttign-detjerrm 
numberof c lock b uffers to bejised based on the clock.buffers 
described irvclock -buffer -file^Oia^'the^lc^dlo-be^ven 
b)^^bufe(s)^thin:a rpw;panvJPhe load to be driven by 



_ _ the:buffer(s):is:based:on:theioad:of-the"latches"410-to^be 

Sil^ficlTd^^ 20 .cMven^d^eload 



channel 418, containing no cells, is locate d between each_ 

nwjrjair-415?0nly : two row pairF415 are shown m-FTGr4A-^^ 
^tojy £id-urjnecessa rily-cluttering^tiie,drawing v It should be 
Tipparent to those of or6Uhary~skill"in theart that the number 

of row pairs 415 within a block could vary among blocks, 25 

with the maximum limit being based on the size of the block 

405. 

I^dplexlccfctninl^ 
4Ar-Local clock:tnirjksr4201u^:4221nay-b^^ 

sainT^system )-clock-or b y separate clocks** The location of 30 trunks 420 and 422 and the placement of the cells in step 
meselrunks_420.and422 are proyia^d-by^exHip^ian^n, 310. 
describedtabove. 

Theplacement:systemr.step;310;of;HG^3,l)laces.the;cells 
withmrtherrow-pairsras'described-ato standard 
routmff-werei-peiformed^witho^ 35 



la tches~410.-rT hese loads were received as inputs to the 
placement:syst£mr310.-Triat is, the original netlist from step 
305 contains the joad s,ot each Jatch-contained :theiein.-This^ 
inforrnauoh~is~cMtamed in an auxiliaryfile along with the 
netlist in the preferred embodimentfThe-ldaa^f-thecline is 
deternriiiedzbased:on:&e:lengm:of:&e-Im 
. unit len gth -The ;load-per:uriU:leflgm:mav-be-input separately 
at-the^locJczbuffenins^rtioncStcp,^ the 
p!a(^men^ys^:along;with:the:latch:loarh of 
each line can be determined given the location of the clock 
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buffersr.the-latchesi would be c onnected : directly:to the clocle 3 
tr unks~over-clock.lines-412 , as>shown in FIG. 4A. Such a 
pl acement-crea teszazsubstanrialrclock-skew-problem^asrafe 
result of RG;Skew duejojthe rjarrownessiokclockTlines^^ 
and^e^differenc^inj^ ^° 
prior art placeme ^oLclock-buff ers~woiild"not-opdmallv 
s olve^m&Tclo"cirskew-proble m,-as:di^ 

After the pla cement of^ e-ceUs;is:Completed,~theip^ 
dp^^ffeis:are:insertedr-step'315. The clock buffers are 
inserted ..based- on me -placement^of the :cells:determined-m 
step310,~the mfojTu^OBra.theclocl^ 

inform ation-inthe clock-na mes filer322. 

c ■ — ■ — — 

The clock jiarnesJUe 1 322:lists lhe:names:of:the:separaie 
clock- signals which- are utilized:in-the logic:block:405-Fofr 50 
example, FIG. 4A shows a logic block 405 having twocloclr 3 
trun ks 42 0gnd i422. :Gock tnmks^420^d-422 r could-carry* 
the same: riock:signal;:e^ 
truriks_ 4^_and 422^niay_ra^ 
clock mTT420rinay^arr^ 

ma y- c ^C?5QSKB..The location of these clock trunks is 
^desCTbed-in- the^cHip:plan 312. The clock signal on each 
clock trunk is described in the clock names file 322. 

Tte^ockrbuffe^fy^ 
clock-buffers-avai ffileJbr-us ej^ 
wide varie^^-cpmmerdallyjavaUable;buffeTS exisbwhich 
coul d be utilized b y the present mveniion. Each clock buffer 
whicb.could be used has^C^ania C,^»value associated 
with iL The G^_yjte is:rae:nriin^ 
buffierjs^pable^o^ivingi wtoeas_G^Tis;the;niaxirBumfc:65 
loadji^uffcrJsjCjro^^ the clock buffer 

file 3 20 conta iiis:a,description of ^todot^buffers^vailable 0 
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It should be noted that jnthg ^^preferred:embQdi mentof th e 3 
present :mventibn~aTmim^um-thres^ 
thejoad of a particular-line need not b^considered. The main 
lojTdJ^g-driyM^ 

is the latches. Thus, if a line between a buffer and a clock 
trunkj>rJajchrisrshor« 

relative to the load of the latches and can safely be ignored. 

In the currently preferred embodiment of the present 
invention, the load of a line may be safely ignored if the 
length is less than 150 microns. Furthermore, in the cur- 
rently preferred embodiment, the chip:plan-312:has:the:clock 
tnml^20.and^422-located;no:more trian:300:microns:aparL 
Thus, the length of the lines connecting the clock buffers and 
the latches 410 may be safely ignored. 

Given thc:loads;tO:be:driven by the clock-buffer(s) and the^> 
lojid:each:buffer4s^i^ invention 
ma^es^mesezvalue^ai^deteam 
numberof buffers required to drive t he load. In th e currently 
preferred embodiment of the present invention, a single 
buffer-wiU:be:US€ d4o drivej hejoad. If a single:bufferrlargc 
enough:to:drive:trie:load:is:notj^^ 
320 then an a^ffiticS : buffer(s) : wffl^ 
suppose the clock buffer file 320 contained two buffers, 
BUF1 having a of 0.566pF and of 0.851pF, and 
BUF2 having a C min of 0.85 lpF and a of 1.274pF. 
Further suppose that the load to be driven by the buffer(s) 
was 1.900pF. The present invention would compare these 
values and determine that two buffers are required to drive 
the load: a BUF1 and a BUF2. 

The currently preferred embodiment utilizes a single 
buffer-to~driv^Hfte^l6a^ 
c^availaBlerHowever, it should be readily apparent to those of 
ordiriaiy;sktil-m : lhel^ 

iiseH : rather-rhfln~fl "^'ngte-fargerpufferJ 

Having determined the size and number of clock buffers 
to use, the currently preferred embodiment of the present 
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invention places the buffers in the row pairs 415 and 
modifies the netlist such that the latches receive clock 
signals from a buffer in the row pair rather than directly from 
the clock trunks 420 or 422. The present invention places the 
clock buffers as close as possible to the clock trunk 420 or 
422 which is closest to the latch being driven. That is, the 
present invention places the clock buffers in the available 
position within the row pairs 415 which is closest to the 
clock trunks 420 and 422. Alternatively, the present inven- 
tion may place the clock buffers right below and very close 
to the clock trunks 420 and 422. Thus, the distance between 
the clock trunk and the clock buffer is minimized. The 
present invention modifies the netlist to include the clock 
buffers in these closest available positions to the clock 
trunks. 

The placement of clock buffers is performed within each 
row pair 415 for each clock trunk 420 and 422. Thus, each 
clock trunk 420 and 422 will have at least one clock buffer 
430 placed close to it in each row pair which has a cluster 
of latches (or single latch) to be driven. 

FIG. 4B shows the placement of clock buffers 430 within 
row pairs 415 of a logic block 405. Note that the buffer lines 
435 will not actually be placed until routing step 435. Clock 
trunks 420 and 422 are wider than buffer lines 435 will be, 
therefore they cause a lower RC delay (a nd thereforgT esult 
f in;less:cldck skew).tWhen the clock signals travel from a 
clock trunk 420 or 422 to a clock buffer 430 over buffer line 
435, the RC delay is greater because the lines 435 are 
narrower. Thus, by placing the buffers 430 very close to the 
clock trunks 420 and 422 the length of the lines 435 is 
reduced, and the RC delay attributable to the lines 435 is 
nnnimized. 

In the currently preferred embodiment of the present 
invention the RC delay caused by the lines 435 is below the 
minimum threshold, discussed above. Thus, the load attrib- 
utable to the buffer wires 435 may be safely disregarded. The 
load attributable to the buffer wires 435 will be below the 
minimum threshold because the clock buffers are inserted 
close to the clock trunks. Note that under certain circum- 
stances the row pairs could be full of cells such that no place 
is available for a clock buffer. In this situation, the present 
invention may reposition the cells relative to the clock 
trunks in order to create a position for the clock buffer. This 
repositioning, however, could result in a cell being pushed 
outside of the functional block boundary. In such a situation, 
the placement of the cells, step 310, must be repeated, or 
alternatively the chip plan 312 must be modified to increase 
the size of the functional block. 

Note that in some instances multiple clock buffers 430 50 
may be inserted into a row pair to drive multiple latches 410 
within that row pair. In this situation the present invention 
must determine which latches are driven by which buffers. 
The preferred embodiment of the present invention resolves 
this situation by determining the latches closest to each 
buffer, as determined by the modified netlist. The present 
invention will drive latches 410 with the buffer 430 closest 
to eac^^atch" 4107 limited byl^e joads bemg o^ y^and^to 
J^mfcjndiC^ oLeach-buffer,.describe^.above. 

A related situatiorTis a^terniinmXwhicirbuffer drives a 
latch located between two clock trunks. In the preferred 
embodiment of the present invention the buffer which is 
closest to the latch in question will drive the latch. However, 
it should be readily apparent to those of ordinary skill in the 
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FIG. 4B shows a logic block 405 after insertion of the 
clock buffers 430. Dashed lines 437 show the modified 
connections between the latches and the clock buffers 430. 
As shown, the latches 410 are no longer directly connected 
to the clock trunks 420 and 422. The latches 410 receive 
clock inputs from the clock buffers 430, as described above, 
which in turn are connected to the clock trunks 420 and 422 
over buffer lines 435. 

It should be noted that the rare situation may occur in 
which a single latch is placed in a row pair having a load 
smaller than the smallest C^ in the clock buffer file 320. In 
the preferred embodiment of the present invention, this 
problem is resolved by repeating step 310. That is, the 
placement system replaces the cells repeatedly until no such 
single latch remains. Alternatively, the designer may manu- 
ally place an existing clock buffer of the appropriate size to 
drive that single latch. 

The p refereed embo^ 
nnj^-jj ^properro^ 

completing the insertion of the clock buffersrTrie routing 
system takes the modified netlist and determines the best 
routing between the clock buffers_430 LandJheJatchesjIlOj 
each bim r er-is J driymg.7Ro1ffi 

me art 'and "any commerrially-a vailable-ro uting:system may 
bej^zed:to~perf6rrn^^ step 325. 

Upon completion of routing step 325, a new schematic is 
produced, step 330, having all cells and buffers placed and 
the routing completed. An example schematic is shown in 
FIG. 4C, showing the routing lines 442 placed in routing 
step 325. 

Hie new schematic is then input into an analysis system, 
step 335, which performs a performance verification of the 
logic block 405. In analysis step 335 the performance of the 
block 405 is compared to a rninimum performance threshold 
defined by the system designer. This threshold is typically 
the minimum speed at which block 405 must run in order to 
function properly within the environment the block 405 is to 
be placed. Such analysis systems are well known in the art 
and thus will not be discussed further. 
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Upon amiple^nHpTthe ranalysisrstep 335,"blie^f rwb 
actioj^niay^te-tak^ 

oJdiwas:satisfied :then:the r^en^mvention^mpj^arfin? 
ished^chematicrstep 340. However, if ; toe;niiiim3um 
oldjsjnomtisj^, t^ 

attain or approach the minimum performance threshold, step 

Theju^ysis:system-wh^ 
m:step 335 is also cjm^ 

logic:block:405rA wide variety of optimization techniques 
r exis^juch 1 as-upMzing;gates :to;make:signals:faster. Any of 
a wide variety of optimization techniques known to those of 
ordinary;skiIl in the art may be utilized tx> attain or approach 
the niimmum timmg thresholdi It should be readily apparent 
to those of ordinary skill in the art that although these 
optimization techniques are designed to attain or approach a 
minimum timing threshold, under certain circumstances a 
particular technique may modify cells such that timing is not 
improved. 

These optimization_techniques,„however, n^y_affect-the 



placemeiit:of;the7reUs3¥l example, 
after modification the cells may be too large to fit within the 
row pairs as they previously did. Thus, the cells in block 405 
must be replaced by the placement system, as done in step 
art that other solutions may exist, such as having a buffer 65 ,310.Before this can. be_ accompli shed, however, the clock 
with extra driving capacity drive the latch even though it buffers 430 which were inserted in step 315 must be 
may not be closest jrem ov&i J rom the ' nelfisT be^uselKeir placement will ^o^ 
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longecbe optimaljiue to the re-placement to berperfonne^ 
in step 310! _ ^ ____r- /_ 

^TherdockblTffexs 43Qrprevi6usly placed in step 315, are 7 * 
removed in step ^SO.Tlus^remo^aUs^u^ 
userrintervention:is:iequired.JThe present invention further 5 
modifies the netlist from step 345 by eliminating the clock 
buffers; ithe-netlist is also mp&fied such that the latches, 
whjch previously recayedc 
now recrive^axldck'si^ 

or 422, as shown in FIG. 5A. The dashed lines 512 show 10 
which signals drive which latches, however the actual rout- 
ing has not occurred yet. It should also be noted that the 
location of the cells shown in FIG. 5A has no significance as 
the cells have not been placed yet 

Ate removal of foe clock bufe \ 5 

the buffers, is again input into the placement system, step 
310/ The placement-system, described abc^e,_wiy ;re r place 
the modified cells as described aboveTThe p|pent invention 
will repeat -steps-31^througbr 325^asTciescribed_above, 
prod^icing^ew^Hematic at step 330^^ 0 exam ple of a new 20 
schematic -is sh ownjn FIG. V 5B* 

^The timing analysisr step 335; will be perform^ again. As 
descried ab^e,Jf w me nev^chematic~satisfieTthe perfor- 
niancerrequirements thenthe present mvention wiU*ou$ut'a 
finished-schematic^tep 340.-If-the^perfoniiance-requife- 25 
ments ~ar¥lullligt7sansfiedj the ceUs^wjll:again:be:modifiedj 
the cloclf buffers removed>and the above process repeated^. 
The_abpye_process-wiUlbe,repeated^ 
requirements.are~satisfied.J 

In arTaltematexmbwlim^^ an additional:biuTermay:be> 30 
•inserted between two la^es'ate te~perfpmanc^ 
c ation o£ step;335;-The performance^ 
minejtoarrn^mumdela^ the signal 

frprr^oj^latej^^ 

tojhe^arrival-of-the clock^signalnlf this minimum delay 35 
problenfislhe only remaiMg-problejn^ 
^notlfcrpla^thecells again. T^us r aiuffiermyj^;in5erted 
al^r foe performance .verification ;^ojieteyHtoe^ignal and 
awidjatchi^^ 

Thus,^e::buffer:is::iiiser^^ 
puffer tolnsert is input-atjtep 315^and is selected based on* 
the jnirnmum May reqinredj, as determin^j^performance. 
yeriJ5cation,~step 335. That is, the propeliSze^ufier-will^ 
generate:a:delay:greater-than the^nmumlielayjrequired? 
A minimum delayjile;324:contai^^ 45 
whiclfnuSf beresolved and~the -latoh^or^e.mentszeach^ 
rninimum-dda v-is-associafed w ith,-and-contams-buffers"t6 
solve eaclfSirnimum .delay^problera 

Gfven-thTsize of the bufferneede d-and -the latc h^which? 5Q 
has the 1 minimum- delay~probtem, -the^ present invention 
mscrts^O^pror^-buffelCSm^ 

minimum dela^r^blem,betweejrtei wo-latches ,-it may-be 
plac^~anywhei^a1ong,th^igrial.path fromthe.source latch 
to the receivmg-latehrAccordWg _ t6~u^^^ 55 
buffejrsjin^pj^^ 
path. 

Routing, step 325, is then repealed. Tfae;placemenrof' 
existing cells was norra&Iifieli^tlu^ 7 
crrmnmally. Upon completion of the ^routing a T MW^e^atic > 6o 
is generated, step; 330, and ; toe -p^ f 
repeatedrste]f335rTfie MseTtkraof the buffers tosolvethe 
min imum -^ay-r^oblcira^ 
tirnirig^c^^^Vthus^:firm 

step 340. 65 

It should be noted that if another minimurn.delay pjpblem 
arises^steps^lS^uVough^^^rfl^^^P^ 31 ^ at.ajuture^ 



date. In such a situation, a buffer which resolves the mini- 
mum delay is inserted, step 315, routing takes place as 
discussed above, and a new finished schematic is produced, 
step 340. 

The preferred embodiment of the present invention, a 
method and apparatus for inserting clock buffers, is thus 
described. While the present invention has been described in 
particular embodiments, it should be appreciated that the 
present invention should not be construed as limited by such 
embodiments, but rather construed according to the below 
claims. 

What is claimed is: 

1. A method for automatically reducing clock skew in a 
logic block having a plurality of cells, the method compris- 
ing the steps of: 

(a) determining a placement of said plurality of cells 
within said logic block; 

(b) determining a placement of a plurality of clock buffers 
within said logic block such that each clock buffer of 
said plurality of clock buffers is located in close prox- 
imity to a clock line; and 

(c) detenruning a routing between said plurality of clock 
buffers, said plurality of cells and said clock line. 

2. A method for reducing clock skew as claimed in claim 

1 further comprising the step of: 

(d) determining the performance of said logic block. 

3. A method for reducing clock skew as claimed in claim 

2 further comprising the steps of: 

(e) removing said plurality of clock buffers from said 
logic block if said performance is below a predeter- 
mined minimum threshold; 

(f) modifying at least one cell of said plurality of cells if 
said performance is below said minimum threshold; 
and 

(g) repeating steps (a) through (g) if said performance is 
below said minimum threshold. 

4. A method for reducing clock skew as claimed in claim 
1 wherein said step of detenruning a placement of a plurality 
of clock buffers comprises locating each clock buffer of said 
plurality of clock buffers in a closest available position to a 
clock line. 

5. A method for reducing clock skew as claimed in claim 
1 wherein said step of determining a placement of said 
plurality of cells comprises receiving a description of con- 
nectivity between a plurality of cells. 

6. A method for reducing clock skew as claimed in claim 
1 wherein said step of determining a placement of a plurality 
of clock buffers further comprises receiving a description of 
the location of said clock line. 

7. A method for reducing clock skew as claimed in claim 
1 wherein said step of deterniining a placement of said 
plurality of cells comprises placing said cells within a 
plurality of rows. 

8. A method for reducing clock skew as claimed in claim 
1 wherein said step of determining the placement of a 
plurality of clock buffers further comprises placing a first set 
of clock buffers of a first clock buffer type corresponding to 
a first clock and placing a second set of clock buffers of a 
second clock buffer type corresponding to a second clock. 

9. A method for reducing clock skew as claimed in claim 
1 wherein said step of deterrnining the placement of said 
plurality of clock buffers further comprises placing a first set 
of clock buffers corresponding to a first clock line and 
placing a second clock buffer corresponding to a second 
clock line. 

10. A method for reducing clock skew as claimed in claim 
1 wherein said step of determining said routing comprises 
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coupling each buffer of said plurality of buffers to said clock 
line and coupling each cell of said plurality of cells requiring 
a clock signal to a buffer of said plurality of buffers such that 
no cell requiring a clock signal is directly coupled to said 
clock line. 5 

U. A method for automatically inserting buffers into a 
logic block having a plurality of cells, the method compris- 
ing the steps of: 

(a) determining a placement of a plurality of cells within 
said logic block; 1° 

(b) determining a placement of a plurality of buffers 
within said logic block; 

(c) detennining a routing between said plurality of buff- 
ers, said plurality of cells and said clock line; 

(d) detennining the performance of said logic block; 

(e) removing said plurality of buffers from said logic 
block if said performance is below a predetermined 
minimum threshold; 

(f) modifying a cell of said plurality of cells if said 20 
performance is below said minimum threshold; and 

(g) repeating steps (a) through (g) if said performance is 
below said minimum threshold. 

12. The method of claim 11 wherein said step (b) com- 
prises detennining a placement of a plurality of buffers 25 
within said logic block such that each clock buffer of said 
plurality of buffers is located in close proximity to a clock 
line. 

13. The method of claim 11 wherein said step (b) com- 
prises locating each buffer of said plurality of buffers in the 30 
closest available position to a clock line. 

14. The method of claim 11 wherein said step (a) com- 
prises placing said cells within a plurality of rows. 

15. The method of claim 11 wherein said step (b) com- 
prises placing a first buffer of a first buffer type and placing 35 
a second buffer of a second buffer type. 

16. The method of claim 11 wherein said step (b) com- 
prises placing a first buffer corresponding to a first clock line 
and placing a second clock buffer corresponding to a second 
clock line. 

17. The method of claim 11 wherein said step (c) com- 
prises coupling each buffer of said plurality of buffers to said 
clock line and coupling each cell of said plurality of cells 
requiring a clock signal to a buffer of said plurality of buffers 
such that no cell requiring a clock signal is directly coupled 45 
to said clock line. 

18. The method of claim 11 wherein said step (g) com- 
prises determining a placement of a buffer and repeating 
steps (c) through (g). 

19. An apparatus for inserting clock buffers into a logic 50 
block having a plurality of cells to reduce clock skew, the 
apparatus comprising: 



a bus; 



40 



a memory device which stores a set of available clock 
buffers, wherein the memory device is coupled to the 
bus; and 

a processor, coupled to the bus, for 
determining a placement of a plurality of cells within 
said logic block, 

determining a placement of a plurality of clock buffers 
selected from the set of available clock buffers within 
said logic block such that each clock buffer of said 
plurality of clock buffers is located in close proximity 
to a clock line, 

determining the routing between said plurality of clock 
buffers, said plurality of cells and said clock line, 

determining the performance of said logic block, 

removing said plurality of clock buffers from said logic 
block if said performance is below a predetermined 
minimum threshold, and 

modifying a cell of said plurality of cells if said perfor- 
mance is below said minimum threshold. 

20. An apparatus for inserting clock buffers as claimed in 
claim 19 wherein said processor for Determining a place- 
ment of a plurality of clock buffers is also for locating each 
clock buffer of said plurality of clock buffers in a closest 
available position to a clock line. 

21. An apparatus for inserting clock buffers as claimed in 
claim 19 wherein said processor for determining the place- 
ment of said plurality of cells places said cells within a 
plurality of rows. 

22. An apparatus for inserting clock buffers as claimed in 
claim 19 wherein said processor for determining the place- 
ment of said plurality of clock buffers is also for placing a 
first set of clock buffers of a first clock buffer type corre- 
sponding to a first clock and placing a second set of clock 
buffers of a second clock buffer type corresponding to a 
second clock. 

23. An apparatus for inserting clock buffers as claimed in 
claim 19 wherein said processor for determining the place- 
ment of said plurality of clock buffers is also for placing a 
first set of clock buffers corresponding to a first clock line 
and placing a second clock buffer corresponding to a second 
clock line. 

24. An apparatus for inserting clock buffers as claimed in 
claim 19 wherein said processor for determining said routing 
is also for coupling each buffer of said plurality of buffers to 
said clock line and coupling each cell of said plurality of 
cells which requires a clock signal to a buffer of said 
plurality of buffers such that no cell requiring a clock signal 
is directly coupled to said clock line. 
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